{"version":"https://jsonfeed.org/version/1","title":"Data Engineering Podcast","home_page_url":"https://www.dataengineeringpodcast.com","feed_url":"https://www.dataengineeringpodcast.com/json","description":"This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.","_fireside":{"subtitle":"Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry","pubdate":"2024-04-21T17:00:00.000-04:00","explicit":false,"copyright":"2024 by Boundless Notions, LLC.","owner":"Tobias Macey","image":"https://assets.fireside.fm/file/fireside-images/podcasts/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/cover.jpg?v=1"},"items":[{"id":"2335d9ff-5fda-498e-a649-355de6c98444","title":"Making Email Better With AI At Shortwave","url":"https://www.dataengineeringpodcast.com/shortwave-ai-powered-email-episode-422","content_text":"Summary\n\nGenerative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Shortwave is and the story behind it?\n\n\nWhat is the core problem that you are addressing with Shortwave?\n\nEmail has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic?\nWhat are the strengths that email maintains as a protocol and ecosystem?\nFrom a product perspective, what are the data challenges that are posed by email?\nCan you describe how you have architected the Shortwave platform?\n\n\nHow have the design and goals of the product changed since you started it?\nWhat are the ways that the advent and evolution of language models have influenced your product roadmap?\n\nHow do you manage the personalization of the AI functionality in your system for each user/team?\nFor users and teams who are using Shortwave, how does it change their workflow and communication patterns?\nCan you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Shortwave used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave?\nWhen is Shortwave the wrong choice?\nWhat do you have planned for the future of Shortwave?\n\n\nContact Info\n\n\nLinkedIn\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nShortwave\nFirebase\nGoogle Inbox\nHey\n\n\nEzra Klein Hey Article\n\nSuperhuman\nPinecone\n\n\nPodcast Episode\n\nElastic\nHybrid Search\nSemantic Search\nMistral\nGPT 3.5\nIMAP\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.","date_published":"2024-04-21T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2335d9ff-5fda-498e-a649-355de6c98444.mp3","mime_type":"audio/mpeg","size_in_bytes":27281389,"duration_in_seconds":3223}]},{"id":"44fbc75b-dc7e-4330-80b0-fac257506966","title":"Designing A Non-Relational Database Engine","url":"https://www.dataengineeringpodcast.com/non-relational-database-design-episode-421","content_text":"Summary\n\nDatabases come in a variety of formats for different use cases. The default association with the term \"database\" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what constitutes a NoSQL database?\n\n\nHow have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago?\n\nWhat are the factors that convince teams to use a NoSQL vs. SQL database?\n\n\nNoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus?\n\nHow have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines?\nWhen designing and building a database, what are the initial set of questions that need to be answered?\n\n\nHow many \"core capabilities\" can you reasonably design around before they conflict with each other?\n\nHow have you approached the evolution of RavenDB as you add new capabilities and mature the project?\n\n\nWhat are some of the early decisions that had to be unwound to enable new capabilities?\n\nIf you were to start from scratch today, what database would you build?\nWhat are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB?\nWhen is a NoSQL database/RavenDB the wrong choice?\nWhat do you have planned for the future of RavenDB?\n\n\nContact Info\n\n\nBlog\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nRavenDB\nRSS\nObject Relational Mapper (ORM)\nRelational Database\nNoSQL\nCouchDB\nNavigational Database\nMongoDB\nRedis\nNeo4J\nCassandra\nColumn-Family\nSQLite\nLevelDB\nFirebird DB\nfsync\nEsent DB?\nKNN == K-Nearest Neighbors\nRocksDB\nC# Language\nASP.NET\nQUIC\nDynamo Paper\nDatabase Internals book (affiliate link)\nDesigning Data Intensive Applications book (affiliate link)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Databases come in a variety of formats for different use cases. The default association with the term \"database\" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.","date_published":"2024-04-14T12:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/44fbc75b-dc7e-4330-80b0-fac257506966.mp3","mime_type":"audio/mpeg","size_in_bytes":48043583,"duration_in_seconds":4561}]},{"id":"77a6c22f-bdfc-48ce-a762-aa94552e1887","title":"Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer","url":"https://www.dataengineeringpodcast.com/cube-semantic-layer-episode-420","content_text":"Summary\n\nMaintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by outlining the technical elements of what it means to have a \"semantic layer\"?\nIn the past couple of years there was a rapid hype cycle around the \"metrics layer\" and \"headless BI\", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts?\nWhat are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.)\n\n\nAt what point does it become necessary/beneficial for a team to adopt such a service?\nWhat are the challenges involved in retrofitting a semantic layer into a production data system?\n\nevolution of requirements/usage patterns\ntechnical complexities/performance and cost optimization\nWhat are the most interesting, innovative, or unexpected ways that you have seen Cube used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?\nWhen is Cube/a semantic layer the wrong choice?\nWhat do you have planned for the future of Cube?\n\n\nContact Info\n\n\nLinkedIn\nkeydunov on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nCube\nSemantic Layer\nBusiness Objects\nTableau\nLooker\n\n\nPodcast Episode\n\nMode\nThoughtspot\nLightDash\n\n\nPodcast Episode\n\nEmbedded Analytics\nDimensional Modeling\nClickhouse\n\n\nPodcast Episode\n\nDruid\nBigQuery\nStarburst\nPinot\nSnowflake\n\n\nPodcast Episode\n\nArrow Datafusion\nMetabase\n\n\nPodcast Episode\n\nSuperset\nAlation\nCollibra\n\n\nPodcast Episode\n\nAtlan\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.","date_published":"2024-04-07T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/77a6c22f-bdfc-48ce-a762-aa94552e1887.mp3","mime_type":"audio/mpeg","size_in_bytes":32270503,"duration_in_seconds":3383}]},{"id":"4be200d5-d131-48f9-bb4e-3368345f7f81","title":"Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary","url":"https://www.dataengineeringpodcast.com/elementary-data-dbt-observability-episode-419","content_text":"Summary\n\nWorking with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.\nYour host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by outlining what elements of observability are most relevant for dbt projects?\nWhat are some of the common ad-hoc/DIY methods that teams develop to acquire those insights?\n\n\nWhat are the challenges/shortcomings associated with those approaches?\n\nOver the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools?\n\n\nWhat are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle?\n\nCan you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects?\nHow is Elementary designed/implemented?\n\n\nHow have the scope and goals of the project changed since you started working on it?\nWhat are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary?\n\nCan you talk us through the setup and workflow for teams adopting Elementary in their dbt projects?\nHow does the incorporation of Elementary change the development habits of the teams who are using it?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Elementary used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary?\nWhen is Elementary the wrong choice?\nWhat do you have planned for the future of Elementary?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nElementary\nData Observability\ndbt\nDatadog\npre-commit\ndbt packages\nSQLMesh\nMalloy\nSDF\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.","date_published":"2024-03-31T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4be200d5-d131-48f9-bb4e-3368345f7f81.mp3","mime_type":"audio/mpeg","size_in_bytes":33417125,"duration_in_seconds":3044}]},{"id":"7527e3ca-cb0b-4963-81f6-1849e41f3f8e","title":"Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+","url":"https://www.dataengineeringpodcast.com/dagster-plus-collaborative-data-orchestration-episode-418","content_text":"Summary\n\nA core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the focus of Dagster+ is and the story behind it?\n\n\nWhat problems are you trying to solve with Dagster+?\nWhat are the notable enhancements beyond the Dagster Core project that this updated platform provides?\nHow is it different from the current Dagster Cloud product?\n\nIn the launch announcement you tease new capabilities that would be great to explore in turns:\n\n\nMake data a team sport, enabling data teams across the organization\nDeliver reliable, high quality data the organization can trust\nObserve and manage data platform costs\nMaster the heterogeneous collection of technologies—both traditional and Modern Data Stack\n\nWhat are the business/product goals that you are focused on improving with the launch of Dagster+\nWhat are the most interesting, innovative, or unexpected ways that you have seen Dagster used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+?\nWhen is Dagster+ the wrong choice?\nWhat do you have planned for the future of Dagster/Dagster Cloud/Dagster+?\n\n\nContact Info\n\n\nTwitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nDagster\n\n\nPodcast Episode\n\nDagster+ Launch Event\nHadoop\nMapReduce\nPydantic\nSoftware Defined Assets\nDagster Insights\nDagster Pipes\nConway's Law\nData Mesh\nDagster Code Locations\nDagster Asset Checks\nDave & Buster's\nSQLMesh\n\n\nPodcast Episode\n\nSDF\nMalloy\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.","date_published":"2024-03-24T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7527e3ca-cb0b-4963-81f6-1849e41f3f8e.mp3","mime_type":"audio/mpeg","size_in_bytes":38239460,"duration_in_seconds":3339}]},{"id":"cd108f0a-aec9-42c3-864e-c1aad5fe95e8","title":"Reconciling The Data In Your Databases With Datafold","url":"https://www.dataengineeringpodcast.com/datafold-database-reconciliation-episode-417","content_text":"Summary\n\nA significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!\nYour host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by outlining some of the situations where reconciling data between databases is needed?\nWhat are examples of the error conditions that you are likely to run into when duplicating information between database engines?\n\n\nWhen these errors do occur, what are some of the problems that they can cause?\n\nWhen teams are replicating data between database engines, what are some of the common patterns for managing those flows?\n\n\nHow does that change between continual and one-time replication?\n\nWhat are some of the steps involved in verifying the integrity of data replication between database engines?\nIf the source or destination isn't a traditional database engine (e.g. data lakehouse) how does that change the work involved in verifying the success of the replication?\nWhat are the challenges of validating and reconciling data?\n\n\nSheer scale and cost of pulling data out, have to do in-place\nPerformance. Pushing databases to the limit, especially hard for OLTP and legacy\nCross-database compatibilty\nData types\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Datafold/data-diff used in the context of cross-database validation?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?\nWhen is Datafold/data-diff the wrong choice?\nWhat do you have planned for the future of Datafold?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nDatafold\n\n\nPodcast Episode\n\ndata-diff\n\n\nPodcast Episode\n\nHive\nPresto\nSpark\nSAP HANA\nChange Data Capture\nNessie\n\n\nPodcast Episode\n\nLakeFS\n\n\nPodcast Episode\n\nIceberg Tables\n\n\nPodcast Episode\n\nSQLGlot\nTrino\nGitHub Copilot\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\n\r\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit <u>[dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council)</u> and use code **dataengpod20** to register today! Promo Code: dataengpod20","content_html":"

Summary

\n\n

A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.","date_published":"2024-03-17T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cd108f0a-aec9-42c3-864e-c1aad5fe95e8.mp3","mime_type":"audio/mpeg","size_in_bytes":34965740,"duration_in_seconds":3494}]},{"id":"702c86bc-f9d8-48be-abe4-74749138b5f1","title":"Version Your Data Lakehouse Like Your Software With Nessie","url":"https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416","content_text":"Summary\n\nData lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!\nYour host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, \"Apache Iceberg, The definitive Guide\", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Nessie is and the story behind it?\nWhat are the core problems/complexities that Nessie is designed to solve?\nThe closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case?\nWhy would someone choose Nessie over native table-level branching in the Apache Iceberg spec?\nHow do the versioning capabilities compare to/augment the data versioning in Iceberg?\nWhat are some of the sources of, and challenges in resolving, merge conflicts between table branches?\nCan you describe the architecture of Nessie?\nHow have the design and goals of the project changed since it was first created?\nWhat is involved in integrating Nessie into a given data stack?\nFor cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively?\nHow does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Nessie used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie?\nWhen is Nessie the wrong choice?\nWhat have you heard is planned for the future of Nessie?\n\n\nContact Info\n\n\nLinkedIn\nTwitter\nAlex's Article on Dremio's Blog\nAlex's Substack\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nProject Nessie\nArticle: What is Nessie, Catalog Versioning and Git-for-Data?\nArticle: What is Lakehouse Management?: Git-for-Data, Automated Apache Iceberg Table Maintenance and more\nFree Early Release Copy of \"Apache Iceberg: The Definitive Guide\"\nIceberg\n\n\nPodcast Episode\n\nArrow\n\n\nPodcast Episode\n\nData Lakehouse\nLakeFS\n\n\nPodcast Episode\n\nAWS Glue\nTabular\n\n\nPodcast Episode\n\nTrino\nPresto\nDremio\n\n\nPodcast Episode\n\nRocksDB\nDelta Lake\n\n\nPodcast Episode\n\nHive Metastore\nPyIceberg\nOptimistic Concurrency Control\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\n\r\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit <u>[dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council)</u> and use code **dataengpod20** to register today! Promo Code: dataengpod20Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.","date_published":"2024-03-10T11:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/702c86bc-f9d8-48be-abe4-74749138b5f1.mp3","mime_type":"audio/mpeg","size_in_bytes":27672114,"duration_in_seconds":2455}]},{"id":"11bc48df-d3ee-4aa8-8881-16e0827cada4","title":"When And How To Conduct An AI Program","url":"https://www.dataengineeringpodcast.com/vast-data-ai-program-episode-415","content_text":"Summary\n\nArtificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!\nYour host is Tobias Macey and today I'm interviewing Colleen Tartow about the questions to answer before and during the development of an AI program\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhen you say \"AI Program\", what are the organizational, technical, and strategic elements that it encompasses?\n\n\nHow does the idea of an \"AI Program\" differ from an \"AI Product\"?\nWhat are some of the signals to watch for that indicate an objective for which AI is not a reasonable solution?\n\nWho needs to be involved in the process of defining and developing that program?\n\n\nWhat are the skills and systems that need to be in place to effectively execute on an AI program?\n\n\"AI\" has grown to be an even more overloaded term than it already was. What are some of the useful clarifying/scoping questions to address when deciding the path to deployment for different definitions of \"AI\"?\nOrganizations can easily fall into the trap of green-lighting an AI project before they have done the work of ensuring they have the necessary data and the ability to process it. What are the steps to take to build confidence in the availability of the data?\n\n\nEven if you are sure that you can get the data, what are the implementation pitfalls that teams should be wary of while building out the data flows for powering the AI system?\nWhat are the key considerations for powering AI applications that are substantially different from analytical applications?\n\nThe ecosystem for ML/AI is a rapidly moving target. What are the foundational/fundamental principles that you need to design around to allow for future flexibility?\nWhat are the most interesting, innovative, or unexpected ways that you have seen AI programs implemented?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on powering AI systems?\nWhen is AI the wrong choice?\nWhat do you have planned for the future of your work at VAST Data?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nVAST Data\nColleen's Previous Appearance\nLinear Regression\nCoreWeave\nLambda Labs\nMAD Landscape\n\n\nPodcast Episode\nML Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\n\r\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit <u>[dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council)</u> and use code **dataengpod20** to register today! Promo Code: dataengpod20Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>","content_html":"

Summary

\n\n

Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.","date_published":"2024-03-03T09:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/11bc48df-d3ee-4aa8-8881-16e0827cada4.mp3","mime_type":"audio/mpeg","size_in_bytes":28819727,"duration_in_seconds":2785}]},{"id":"8e69099e-4d0c-4dae-a085-be14299c780f","title":"Find Out About The Technology Behind The Latest PFAD In Analytical Database Development","url":"https://www.dataengineeringpodcast.com/influxdb-fdap-database-stack-episode-414","content_text":"Summary\n\nBuilding a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!\nYour host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?\n\n\nThis was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?\n\nEach of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components?\nOne of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform?\nCan you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack?\n\n\nWhat are the elements of the overall product/user experience that you had to build to create a cohesive platform?\n\nWhat are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack?\nWhat are the pieces of the Arrow ecosystem that are still immature or need further investment from the community?\nWhat are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack?\nWhen is the FDAP stack the wrong choice?\nWhat do you have planned for the future of the InfluxDB IOx engine and the FDAP stack?\n\n\nContact Info\n\n\nLinkedIn\npauldix on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nFDAP Stack Blog Post\nApache Arrow\nDataFusion\nArrow Flight\nApache Parquet\nInfluxDB\nInflux Data\n\n\nPodcast Episode\n\nRust Language\nDuckDB\nClickHouse\nVoltron Data\n\n\nPodcast Episode\n\nVelox\nIceberg\n\n\nPodcast Episode\n\nTrino\nODBC == Open DataBase Connectivity\nGeoParquet\nORC == Optimized Row Columnar\nAvro\nProtocol Buffers\ngRPC\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\n\r\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit <u>[dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council)</u> and use code **dataengpod20** to register today! Promo Code: dataengpod20Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.","date_published":"2024-02-25T13:15:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8e69099e-4d0c-4dae-a085-be14299c780f.mp3","mime_type":"audio/mpeg","size_in_bytes":37978822,"duration_in_seconds":3360}]},{"id":"d719c362-b99b-4fc1-aacf-e68f75a973be","title":"Using Trino And Iceberg As The Foundation Of Your Data Lakehouse","url":"https://www.dataengineeringpodcast.com/starburst-trino-iceberg-data-lakehouse-episode-413","content_text":"Summary\n\nA data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nJoin in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today.\nYour host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nTo start, can you share your definition of what constitutes a \"Data Lakehouse\"?\n\n\nWhat are the technical/architectural/UX challenges that have hindered the progression of lakehouses?\nWhat are the notable advancements in recent months/years that make them a more viable platform choice?\n\nThere are multiple tools and vendors that have adopted the \"data lakehouse\" terminology. What are the benefits offered by the combination of Trino and Iceberg?\n\n\nWhat are the key points of comparison for that combination in relation to other possible selections?\n\nWhat are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?\n\n\nWhat progress is being made (within or across the ecosystem) to address those sharp edges?\n\nFor someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements?\nWhat are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem?\nWhen is a lakehouse the wrong choice?\nWhat do you have planned for the future of Trino/Starburst?\n\n\nContact Info\n\n\nLinkedIn\ndain on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nTrino\nStarburst\nPresto\nJBoss\nJava EE\nHDFS\nS3\nGCS == Google Cloud Storage\nHive\nHive ACID\nApache Ranger\nOPA == Open Policy Agent\nOso\nAWS Lakeformation\nTabular\nIceberg\n\n\nPodcast Episode\n\nDelta Lake\n\n\nPodcast Episode\n\nDebezium\n\n\nPodcast Episode\n\nMaterialized View\nClickhouse\nDruid\nHudi\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\n\r\nJoin us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit <u>[dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council)</u> and use code **dataengpod20** to register today! Promo Code: dataengpod20Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.","date_published":"2024-02-18T15:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d719c362-b99b-4fc1-aacf-e68f75a973be.mp3","mime_type":"audio/mpeg","size_in_bytes":32189267,"duration_in_seconds":3526}]},{"id":"e0ccc619-dec6-46f8-a4a2-d4b3b9bea8fc","title":"Data Sharing Across Business And Platform Boundaries","url":"https://www.dataengineeringpodcast.com/bobsled-data-sharing-platform-episode-412","content_text":"Summary\n\nSharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nYour host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving some context and scope of what we mean by \"data sharing\" for the purposes of this conversation?\nWhat is the current state of the ecosystem for data sharing protocols/practices/platforms?\n\n\nWhat are some of the main challenges/shortcomings that teams/organizations experience with these options?\n\nWhat are the technical capabilities that need to be present for an effective data sharing solution?\n\n\nHow does that change as a function of the type of data? (e.g. tabular, image, etc.)\n\nWhat are the requirements around governance and auditability of data access that need to be addressed when sharing data?\nWhat are the typical boundaries along which data access requires special consideration for how the sharing is managed?\nMany data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing?\nWhen is Bobsled the wrong choice?\nWhat do you have planned for the future of data sharing?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nBobsled\nOLAP == OnLine Analytical Processing\nCassandra\n\n\nPodcast Episode\n\nNeo4J\nFTP == File Transfer Protocol\nS3 Access Points\nSnowflake Sharing\nBigQuery Sharing\nDatabricks Delta Sharing\nDuckDB\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"

Summary

\n\n

Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.","date_published":"2024-02-11T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e0ccc619-dec6-46f8-a4a2-d4b3b9bea8fc.mp3","mime_type":"audio/mpeg","size_in_bytes":33036920,"duration_in_seconds":3595}]},{"id":"2724ef53-ecf5-4215-8578-339724d72f25","title":"Tackling Real Time Streaming Data With SQL Using RisingWave","url":"https://www.dataengineeringpodcast.com/risingwave-streaming-database-engine-episode-411","content_text":"Summary\n\nStream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nYour host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what RisingWave is and the story behind it?\nThere are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?\n\n\nWhat are some of the platforms/architectures that teams are replacing with RisingWave?\n\nWhat are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem?\nCan you describe how RisingWave is architected and implemented?\n\n\nHow have the design and goals/scope changed since you first started working on it?\nWhat are the core design philosophies that you rely on to prioritize the ongoing development of the project?\n\nWhat are the most complex engineering challenges that you have had to address in the creation of RisingWave?\nCan you describe a typical workflow for teams that are building on top of RisingWave?\n\n\nWhat are the user/developer experience elements that you have prioritized most highly?\n\nWhat are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine?\nWhat are the most interesting, innovative, or unexpected ways that you have seen RisingWave used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave?\nWhen is RisingWave the wrong choice?\nWhat do you have planned for the future of RisingWave?\n\n\nContact Info\n\n\nyingjunwu on GitHub\nPersonal Website\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nRisingWave\nAWS Redshift\nFlink\n\n\nPodcast Episode\n\nClickhouse\n\n\nPodcast Episode\n\nDruid\nMaterialize\nSpark\nTrino\nSnowflake\nKafka\nIceberg\n\n\nPodcast Episode\n\nHudi\n\n\nPodcast Episode\n\nPostgres\nDebezium\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>","content_html":"

Summary

\n\n

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.","date_published":"2024-02-04T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2724ef53-ecf5-4215-8578-339724d72f25.mp3","mime_type":"audio/mpeg","size_in_bytes":28223296,"duration_in_seconds":3415}]},{"id":"94874f82-4323-4107-88fd-29971460e9a1","title":"Build A Data Lake For Your Security Logs With Scanner","url":"https://www.dataengineeringpodcast.com/scanner-security-data-lake-episode-410","content_text":"Summary\n\nMonitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Scanner is and the story behind it?\n\n\nWhat were the shortcomings of other tools that are available in the ecosystem?\n\nWhat is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)\nA query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.\n\n\nWhat are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?\n\nLog data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?\nCan you describe the architecture of the Scanner platform?\n\n\nWhat were the motivating constraints that led you to your current implementation?\nHow have the design and goals of the product changed since you first started working on it?\n\nGiven the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?\nWhat are the personas of the end-users for Scanner?\n\n\nHow has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?\n\nFor teams who are working with Scanner can you describe how it fits into their workflow?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Scanner used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?\nWhen is Scanner the wrong choice?\nWhat do you have planned for the future of Scanner?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nScanner\ncURL\nRust\nSplunk\nS3\nAWS Athena\nLoki\nSnowflake\n\n\nPodcast Episode\n\nPresto\n[Trino](thttps://trino.io/)\nAWS CloudTrail\nGitHub Audit Logs\nOkta\nCribl\nVector.dev\nTines\nTorq\nJira\nLinear\nECS Fargate\nSQS\nMonoid\nGroup Theory\nAvro\nParquet\nOCSF\nVPC Flow Logs\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>","content_html":"

Summary

\n\n

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.","date_published":"2024-01-28T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/94874f82-4323-4107-88fd-29971460e9a1.mp3","mime_type":"audio/mpeg","size_in_bytes":35613954,"duration_in_seconds":3758}]},{"id":"ab7ae66f-3e2d-4f9f-a517-fe92de02a17f","title":"Modern Customer Data Platform Principles","url":"https://www.dataengineeringpodcast.com/actioniq-modern-customer-data-platform-episode-409","content_text":"Summary\n\nDatabases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.\nYour host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the role of the CDP is in the context of a businesses data ecosystem?\n\n\nWhat are the core technical challenges associated with building and maintaining a CDP?\nWhat are the organizational/business factors that contribute to the complexity of these systems?\n\nThe early days of CDPs came with the promise of \"Customer 360\". Can you unpack that concept and how it has changed over the past ~5 years?\nRecent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs?\n\n\nHow have the architectural shifts changed the ways that organizations interact with their customer data?\n\nHow have the responsibilities shifted across different roles?\n\n\nWhat are the governance policy and enforcement challenges that are added with the expansion of access and responsibility?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs?\nWhen is a CDP the wrong choice?\nWhat do you have planned for the future of ActionIQ?\n\n\nContact Info\n\n\nLinkedIn\n@Tasso on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nAction IQ\nAster Data\nTeradata\nFilemaker\nHadoop\nNoSQL\nHive\nInformix\nParquet\nSnowflake\n\n\nPodcast Episode\n\nSpark\nRedshift\nUnity Catalog\nCustomer Data Platform\nCDP Market Guide\nKaizen\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png)\r\n\r\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at <u>[dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).</u>","content_html":"

Summary

\n\n

Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).","date_published":"2024-01-21T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ab7ae66f-3e2d-4f9f-a517-fe92de02a17f.mp3","mime_type":"audio/mpeg","size_in_bytes":30070279,"duration_in_seconds":3693}]},{"id":"551d2efd-666e-4cc0-a392-4668042cc806","title":"Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel","url":"https://www.dataengineeringpodcast.com/jignesh-patel-scalability-and-ux-research-episode-408","content_text":"Summary\n\nData processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by summarizing your current areas of research and the motivations behind them?\nWhat are the open questions today in technical scalability of data engines?\n\n\nWhat are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems?\n\nAs you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems?\n\n\nWhen performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product?\n\nWhat are the main sources of tension between technical scalability and user experience/ease of comprehension?\nWhat are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities?\n\n\nIn what ways do they produce conflict, whether personally or technically?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen your research used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems?\nWhat is your heuristic for when a given research project needs to be terminated or productionized?\nWhat do you have planned for the future of your academic research?\n\n\nContact Info\n\n\nWebsite\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nCarnegie Mellon Universe\nParallel Databases\nGenomics\nProteomics\nMoore's Law\nDennard Scaling\nGenerative AI\nQuantum Computing\nVoltron Data\n\n\nPodcast Episode\n\nVon Neumann Architecture\nTwo's Complement\nOttertune\n\n\nPodcast Episode\n\ndbt\nInformatica\nMozart Data\n\n\nPodcast Episode\n\nDataChat\nVon Neumann Bottleneck\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>","content_html":"

Summary

\n\n

Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.","date_published":"2024-01-07T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/551d2efd-666e-4cc0-a392-4668042cc806.mp3","mime_type":"audio/mpeg","size_in_bytes":28853822,"duration_in_seconds":3026}]},{"id":"ed8dacce-1cf2-434b-8afd-ce1ca1a841d0","title":"Designing Data Platforms For Fintech Companies","url":"https://www.dataengineeringpodcast.com/fintech-data-engineering-episode-407","content_text":"Summary\n\nWorking with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by summarizing the data challenges that are particular to the fintech ecosystem?\nWhat are the primary sources and types of data that fintech organizations are working with?\n\n\nWhat are the business-level capabilities that are dependent on this data?\n\nHow do the regulatory and business requirements influence the technology landscape in fintech organizations?\n\n\nWhat does a typical build vs. buy decision process look like?\n\nFraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech?\n\n\nHow does that influence the architectural design/capabilities for data platforms in those organizations?\n\nData governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens?\nWhat are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech?\nWhat do you have planned for the future of your data capabilities at Monite?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nMonite\nISO 270001\nTesseract\nGitOps\nSWIFT Protocol\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.","date_published":"2023-12-31T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ed8dacce-1cf2-434b-8afd-ce1ca1a841d0.mp3","mime_type":"audio/mpeg","size_in_bytes":25890257,"duration_in_seconds":2876}]},{"id":"0fd4f377-ec60-4f29-ae3e-55f59818a7fa","title":"Troubleshooting Kafka In Production","url":"https://www.dataengineeringpodcast.com/kafka-troubleshooting-in-production-episode-406","content_text":"Summary\n\nKafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book \"Kafka: : Troubleshooting in Production\". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe your experiences with Kafka?\n\n\nWhat are the operational challenges that you have had to overcome while working with Kafka?\nWhat motivated to write a book about how to manage Kafka in production?\n\nThere are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?\n\n\nIn the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?\n\nWhen provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?\n\n\nWhat are the axes along which size/scale need to be determined?\n\nThe core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?\n\n\nUnder what circumstances can data be lost?\n\nWhat are the different failure conditions that cluster operators need to be aware of?\n\n\nWhat are the monitoring strategies that are most helpful for identifying (proactively or reactively) those errors?\n\nIn the event of these different cluster errors, what are the strategies for mitigating and recovering from those failures?\nWhen a cluster's usage expands beyond the original designed capacity, what are the options/procedures for expanding that capacity?\n\n\nWhen a cluster is underutilized, how can it be scaled down to reduce cost?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Kafka used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with Kafka?\nWhen is Kafka the wrong choice?\nWhat are the changes that you would like to see in Kafka to make it easier to operate?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nKafka: Troubleshooting in Production book (affiliate link)\nIronSource\nDruid\nTrino\nKafka\nSpark\nSRE == Site Reliability Engineer\nPresto\nSystem Performance by Brendan Gregg (affiliate link)\nHortonWorks\nRAID == Redundant Array of Inexpensive Disks\nJBOD == Just a Bunch Of Disks\nAWS MSK\nConfluent\nAiven\nJStat\nKafka Tiered Storage\nBrendan Gregg iostat utilization explanation\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book \"Kafka: Troubleshooting in Production\". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.","date_published":"2023-12-24T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0fd4f377-ec60-4f29-ae3e-55f59818a7fa.mp3","mime_type":"audio/mpeg","size_in_bytes":44819535,"duration_in_seconds":4483}]},{"id":"7f391746-a0ea-4c03-a4aa-4b7db18f995d","title":"Adding An Easy Mode For The Modern Data Stack With 5X","url":"https://www.dataengineeringpodcast.com/5x-integrated-modern-data-stack-episode-405","content_text":"Summary\n\nThe \"modern data stack\" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what 5x is and the story behind it?\n\n\nWe last spoke in March of 2022. What are the notable changes in the 5x business and product?\n\nWhat are the notable shifts in the data ecosystem that have influenced your adoption and product direction?\n\n\nWhat trends are you most focused on tracking as you plan the continued evolution of your offerings?\n\nWhat are the points of friction that teams run into when trying to build their data platform?\nCan you describe design of the system that you have built?\n\n\nWhat are the strategies that you rely on to support adaptability and speed of onboarding for new integrations?\n\nWhat are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers?\nWhat is your process for selection of vendors to support?\n\n\nHow would you characterize your relationships with the vendors that you rely on?\n\nFor customers who have pre-existing investment in a portion of the data stack, what is your process for engaging with them to understand how best to support their goals?\nWhat are the most interesting, innovative, or unexpected ways that you have seen 5XData used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on 5XData?\nWhen is 5X the wrong choice?\nWhat do you have planned for the future of 5X?\n\n\nContact Info\n\n\nLinkedIn\n@tarush on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\n5X\nInformatica\nSnowflake\n\n\nPodcast Episode\n\nLooker\n\n\nPodcast Episode\n\nDuckDB\n\n\nPodcast Episode\n\nRedshift\nReverse ETL\nFivetran\n\n\nPodcast Episode\n\nRudderstack\n\n\nPodcast Episode\n\nPeak.ai\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The \"modern data stack\" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.","date_published":"2023-12-17T21:15:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7f391746-a0ea-4c03-a4aa-4b7db18f995d.mp3","mime_type":"audio/mpeg","size_in_bytes":38545803,"duration_in_seconds":3372}]},{"id":"db3c8fb4-1139-4111-bc36-57045d6aee1d","title":"Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack","url":"https://www.dataengineeringpodcast.com/anomstack-open-source-business-metric-anomaly-detection-episode-404","content_text":"Summary\n\nIf your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Anomstack is and the story behind it?\n\n\nWhat are your goals for this project?\nWhat other tools/products might teams be evaluating while they consider Anomstack?\n\nIn the context of Anomstack, what constitutes a \"metric\"?\n\n\nWhat are some examples of useful metrics that a data team might want to monitor?\n\nYou put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project?\nWhat are the core capabilities and constraints that you selected to provide the focus and architecture of the project?\nCan you describe how Anomstack is implemented?\n\n\nHow have the design and goals of the project changed since you first started working on it?\n\nWhat are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform?\n\n\nWhat are the sharp edges that are still present in the system?\n\nWhat are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Anomstack used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack?\nWhen is Anomstack the wrong choice?\nWhat do you have planned for the future of Anomstack?\n\n\nContact Info\n\n\nLinkedIn\nTwitter\nGitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nAnomstack Github repo\nAirflow Anomaly Detection Provider Github repo\nNetdata\nMetric Tree\nSemantic Layer\nPrometheus\nAnodot\nChaos Genius\nMetaplane\nAnomalo\nPyOD\nAirflow\nDuckDB\nAnomstack Gallery\nDagster\nInfluxDB\nTimeGPT\nProphet\nGreyKite\nOpenLineage\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png)\r\n\r\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at <u>[dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).</u>Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.","date_published":"2023-12-10T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/db3c8fb4-1139-4111-bc36-57045d6aee1d.mp3","mime_type":"audio/mpeg","size_in_bytes":33457205,"duration_in_seconds":3077}]},{"id":"048c1e82-10bb-4d09-ab49-7112621b5b28","title":"Designing Data Transfer Systems That Scale","url":"https://www.dataengineeringpodcast.com/doublecloud-data-transfer-episode-403","content_text":"Summary\n\nThe first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold today!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYour most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve?\n\n\nWhat were the shortcomings of other options in the ecosystem that led you to building a new system?\n\nWhat was the design of your initial solution to the problem?\n\n\nWhat are the sharp edges that you had to deal with to operate and use that initial implementation?\n\nWhat were the limitations of the system as you started to scale it?\nCan you describe the current architecture of your data transfer platform?\n\n\nWhat are the capabilities and constraints that you are optimizing for?\n\nAs you move beyond the initial use case that started you down this path, what are the complexities involved in generalizing to add new functionality or integrate with additional platforms?\nWhat are the most interesting, innovative, or unexpected ways that you have seen your data transfer service used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the data transfer system?\nWhen is DoubleCloud Data Transfer the wrong choice?\nWhat do you have planned for the future of DoubleCloud Data Transfer?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDoubleCloud\nKafka\nMapReduce\nChange Data Capture\nClickhouse\n\n\nPodcast Episode\n\nIceberg\n\n\nPodcast Episode\n\nDelta Lake\n\n\nPodcast Episode\n\ndbt\nOpenMetadata\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\nSpeaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.Sponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting <u>[dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)</u> today!","content_html":"

Summary

\n\n

The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n\n

Speaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.

Sponsored By:

","summary":"The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.","date_published":"2023-12-04T00:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/048c1e82-10bb-4d09-ab49-7112621b5b28.mp3","mime_type":"audio/mpeg","size_in_bytes":37938309,"duration_in_seconds":3837}]},{"id":"4666a5a3-5ccd-4b25-a0e5-22d6d0d47df6","title":"Addressing The Challenges Of Component Integration In Data Platform Architectures","url":"https://www.dataengineeringpodcast.com/data-platform-architecture-component-integration-episode-402","content_text":"Summary\n\nBuilding a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nDeveloping event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis today to get started!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\ndata sharing\nweight of history\n\n\nexisting integrations with dbt\nswitching cost for e.g. SQLMesh\nde facto standard of Airflow\n\nSingle source of truth\n\n\npermissions management across application layers\nDatabase engine\nStorage layer in a lakehouse\nPresentation/access layer (BI)\nData flows\ndbt -> table level lineage\norchestration engine -> pipeline flows\n\n\ntask based vs. asset based\n\nMetadata platform as the logical place for horizontal view\n\n\n\nContact Info\n\n\nLinkedIn\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nMonologue Episode On Data Platform Design\nMonologue Episode On Leaky Abstractions\nAirbyte\n\n\nPodcast Episode\n\nTrino\nDagster\ndbt\nSnowflake\nBigQuery\nOpenMetadata\nOpenLineage\nData Platform Shadow IT Episode\nPreset\nLightDash\n\n\nPodcast Episode\n\nSQLMesh\n\n\nPodcast Episode\n\nAirflow\nSpark\nFlink\nTabular\nIceberg\nOpen Policy Agent\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Memphis: ![Memphis Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/dYE97ze8.png)\r\n\r\nDeveloping event-driven pipelines is going to be a lot easier - Meet Functions!\r\n\r\nMemphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to <u>[dataengineeringpodcast.com/memphis](https://www.dataengineeringpodcast.com/memphis)</u> today to get started!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.","date_published":"2023-11-26T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4666a5a3-5ccd-4b25-a0e5-22d6d0d47df6.mp3","mime_type":"audio/mpeg","size_in_bytes":21097846,"duration_in_seconds":1782}]},{"id":"cb4150d5-ca91-4980-96d3-74392f7ec559","title":"Unlocking Your dbt Projects With Practical Advice For Practitioners","url":"https://www.dataengineeringpodcast.com/practical-dbt-projects-episode-401","content_text":"Summary\n\nThe dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your path to adoption of dbt?\n\n\nWhat did you use prior to its existence?\nWhen/why/how did you start using it?\n\nWhat are some of the common challenges that teams experience when getting started with dbt?\n\n\nHow does prior experience in analytics and/or software engineering impact those outcomes?\n\nYou recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort?\n\n\nWhat new lessons did you learn about dbt in the process of writing the book?\n\nThe introduction of dbt is largely responsible for catalyzing the growth of \"analytics engineering\". As practitioners in the space, what do you see as the net result of that trend?\n\n\nWhat are the lessons that we all need to invest in independent of the tool?\n\nFor someone starting a new dbt project today, can you talk through the decisions that will be most critical for ensuring future success?\nAs dbt projects scale, what are the elements of technical debt that are most likely to slow down engineers?\n\n\nWhat are the capabilities in the dbt framework that can be used to mitigate the effects of that debt?\nWhat tools or processes outside of dbt can help alleviate the incidental complexity of a large dbt project?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen dbt used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with dbt? (as engineers and/or as autors)\nWhat is on your personal wish-list for the future of dbt (or its competition?)?\n\n\nContact Info\n\n\nDustin\n\n\nLinkedIn\n\nCameron\n\n\nLinkedIn\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nBiobot Analytic\nBreezeway\ndbt\n\n\nPodcast Episode\n\nSynapse Analytics\nSnowflake\n\n\nPodcast Episode\n\nFivetran\n\n\nPodcast Episode\n\nAnalytics Power Hour\nDDL == Data Definition Language\nDML == Data Manipulation Language\ndbt codegen\nUnlocking dbt book (affiliate link)\ndbt Mesh\ndbt Semantic Layer\nGitHub Actions\nMetaplane\n\n\nPodcast Episode\n\nDataTune Conference\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png)\r\n\r\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at <u>[dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).</u>Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!","content_html":"

Summary

\n\n

The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.","date_published":"2023-11-19T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cb4150d5-ca91-4980-96d3-74392f7ec559.mp3","mime_type":"audio/mpeg","size_in_bytes":51595209,"duration_in_seconds":4564}]},{"id":"240100bc-a947-42ff-85ba-5afe4a26b752","title":"Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine","url":"https://www.dataengineeringpodcast.com/tabnine-generative-ai-developer-assistant-episode-400","content_text":"Summary\n\nSoftware development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in machine learning?\nCan you describe what Tabnine is and the story behind it?\nWhat are the individual and organizational motivations for using AI to generate code?\n\n\nWhat are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.)\nWhat are the elements of skepticism/oversight that developers need to exercise while using a system like Tabnine?\n\nWhat are some of the primary ways that developers interact with Tabnine during their development workflow?\n\n\nAre there any particular styles of software for which an AI is more appropriate/capable? (e.g. webapps vs. data pipelines vs. exploratory analysis, etc.)\n\nFor natural languages there is a strong bias toward English in the current generation of LLMs. How does that translate into computer languages? (e.g. Python, Java, C++, etc.)\nCan you describe the structure and implementation of Tabnine?\n\n\nDo you rely primarily on a single core model, or do you have multiple models with subspecialization?\nHow have the design and goals of the product changed since you first started working on it?\n\nWhat are the biggest challenges in building a custom LLM for code?\n\n\nWhat are the opportunities for specialization of the model architecture given the highly structured nature of the problem domain?\n\nFor users of Tabnine, how do you assess/monitor the accuracy of recommendations?\n\n\nWhat are the feedback and reinforcement mechanisms for the model(s)?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Tabnine's LLM powered coding assistant used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on AI assisted development at Tabnine?\nWhen is an AI developer assistant the wrong choice?\nWhat do you have planned for the future of Tabnine?\n\n\nContact Info\n\n\nLinkedIn\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest barrier to adoption of machine learning today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nTabNine\nTechnion University\nProgram Synthesis\nContext Stuffing\nElixir\nDependency Injection\nCOBOL\nVerilog\nMidJourney\n\n\nThe intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0Sponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Sponsored By:

","summary":"Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.","date_published":"2023-11-12T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/240100bc-a947-42ff-85ba-5afe4a26b752.mp3","mime_type":"audio/mpeg","size_in_bytes":32430435,"duration_in_seconds":4072}]},{"id":"9acc012e-4ea9-4f27-a638-db4443102b26","title":"Shining Some Light In The Black Box Of PostgreSQL Performance","url":"https://www.dataengineeringpodcast.com/pganalyze-postgresql-performance-tuning-episode-399","content_text":"Summary\n\nDatabases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYour host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the different ways that database performance problems impact the business?\nWhat are the most common contributors to performance issues?\nWhat are the useful signals that indicate performance challenges in the database?\n\n\nFor a given symptom, what are the steps that you recommend for determining the proximate cause?\n\nWhat are the potential negative impacts to be aware of when tuning the configuration of your database?\nHow does the database engine influence the methods used to identify and resolve performance challenges?\nMost of the database engines that are in common use today have been around for decades. How have the lessons learned from running these systems over the years influenced the ways to think about designing new engines or evolving the ones we have today?\nWhat are the most interesting, innovative, or unexpected ways that you have seen to address database performance?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on databases?\nWhat are your goals for the future of database engines?\n\n\nContact Info\n\n\nLinkedIn\n@LukasFittl on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nPGAnalyze\nCitus Data\n\n\nPodcast Episode\n\nORM == Object Relational Mapper\nN+1 Query\nAutovacuum\nWrite-ahead Log\npg_stat_io\nrandom_page_cost\npgvector\nVector Database\nOttertune\n\n\nPodcast Episode\n\nCitus Extension\nHydra\nClickhouse\n\n\nPodcast Episode\n\nMyISAM\nMyRocks\nInnoDB\nGreat Expectations\n\n\nPodcast Episode\n\nOpenTelemetry\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.","date_published":"2023-11-05T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9acc012e-4ea9-4f27-a638-db4443102b26.mp3","mime_type":"audio/mpeg","size_in_bytes":38953439,"duration_in_seconds":3291}]},{"id":"921b519b-7ae6-43b6-8df0-39081bc9712d","title":"Surveying The Market Of Database Products","url":"https://www.dataengineeringpodcast.com/database-products-market-survey-episode-398","content_text":"Summary\n\nDatabases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.\nYour host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the aspects of the database market that keep you interested as a VP of product?\n\n\nHow have your experiences at Elastic informed your current work at Clickhouse?\n\nWhat are the main product categories for databases today?\n\n\nWhat are the industry trends that have the most impact on the development and growth of different product categories?\nWhich categories do you see growing the fastest?\n\nWhen a team is selecting a database technology for a given task, what are the types of questions that they should be asking?\nTransactional engines like Postgres, SQL Server, Oracle, etc. were long used as analytical databases as well. What is driving the broad adoption of columnar stores as a separate environment from transactional systems?\n\n\nWhat are the inefficiencies/complexities that this introduces?\nHow can the database engine used for analytical systems work more closely with the transactional systems?\n\nWhen building analytical systems there are numerous moving parts with intricate dependencies. What is the role of the database in simplifying observability of these applications?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Clickhouse used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on database products?\nWhat are your prodictions for the future of the database market?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nClickhouse\n\n\nPodcast Episode\n\nElastic\nOLAP\nOLTP\nGraph Database\nVector Database\nTrino\nPresto\nForeign data wrapper\ndbt\n\n\nPodcast Episode\n\nOpenTelemetry\nIceberg\n\n\nPodcast Episode\n\nParquet\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png)\r\n\r\nData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at <u>[dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).</u>Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.","date_published":"2023-10-29T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/921b519b-7ae6-43b6-8df0-39081bc9712d.mp3","mime_type":"audio/mpeg","size_in_bytes":33340819,"duration_in_seconds":2832}]},{"id":"c5dfb814-b9bd-4147-a54c-f5707f4b9814","title":"Defining A Strategy For Your Data Products","url":"https://www.dataengineeringpodcast.com/data-product-strategy-episode-397","content_text":"Summary\n\nThe primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nAs more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYour host is Tobias Macey and today I'm interviewing Ranjith Raghunath about tactical elements of a data product strategy\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what is encompassed by the idea of a data product strategy?\n\n\nWhich roles in an organization need to be involved in the planning and implementation of that strategy?\n\norder of operations:\n\n\nstrategy -> platform design -> implementation/adoption\nplatform implementation -> product strategy -> interface development\n\nmanaging grain of data in products\nteam organization to support product development/deployment\ncustomer communications - what questions to ask? requirements gathering, helping to understand \"the art of the possible\"\nWhat are the most interesting, innovative, or unexpected ways that you have seen organizations approach data product strategies?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on defining and implementing data product strategies?\nWhen is a data product strategy overkill?\nWhat are some additional resources that you recommend for listeners to direct their thinking and learning about data product strategy?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nCXData Labs\nDimensional Modeling\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png)\r\n\r\nNODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks:\r\n- Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript\r\n- Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI)\r\n- Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)\r\n\r\nDon’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to <u>[Neo4j.com/NODES](https://Neo4j.com/NODES)</u> today to see the full agenda and register!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.","date_published":"2023-10-22T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c5dfb814-b9bd-4147-a54c-f5707f4b9814.mp3","mime_type":"audio/mpeg","size_in_bytes":40915109,"duration_in_seconds":3830}]},{"id":"67fcadfc-45b4-4c62-aafe-852e20d37c72","title":"Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable","url":"https://www.dataengineeringpodcast.com/decodable-stream-processing-as-a-service-episode-396","content_text":"Summary\n\nBuilding streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nAs more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.\nYour host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Decodable is and the story behind it?\n\n\nWhat are the notable changes to the Decodable platform since we last spoke? (October 2021)\nWhat are the industry shifts that have influenced the product direction?\n\nWhat are the problems that customers are trying to solve when they come to Decodable?\nWhen you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL?\nWhat are the developer experience challenges that are particular to working with streaming data?\n\n\nHow have you worked to address that in the Decodable platform and interfaces?\n\nAs you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Decodable used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?\nWhen is Decodable the wrong choice?\nWhat do you have planned for the future of Decodable?\n\n\nContact Info\n\n\nesammer on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDecodable\n\n\nPodcast Episode\n\nUnderstanding the Apache Flink Journey\nFlink\n\n\nPodcast Episode\n\nDebezium\n\n\nPodcast Episode\n\nKafka\nRedpanda\n\n\nPodcast Episode\n\nKinesis\nPostgreSQL\n\n\nPodcast Episode\n\nSnowflake\n\n\nPodcast Episode\n\nDatabricks\nStartree\nPinot\n\n\nPodcast Episode\n\nRockset\n\n\nPodcast Episode\n\nDruid\nInfluxDB\nSamza\nStorm\nPulsar\n\n\nPodcast Episode\n\nksqlDB\n\n\nPodcast Episode\n\ndbt\nGitHub Actions\nAirbyte\nSinger\nSplunk\nOutbox Pattern\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png)\r\n\r\nNODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks:\r\n- Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript\r\n- Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI)\r\n- Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)\r\n\r\nDon’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to <u>[Neo4j.com/NODES](https://Neo4j.com/NODES)</u> today to see the full agenda and register!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.","date_published":"2023-10-15T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/67fcadfc-45b4-4c62-aafe-852e20d37c72.mp3","mime_type":"audio/mpeg","size_in_bytes":42789353,"duration_in_seconds":4108}]},{"id":"0b3bf255-c4bc-4126-9ddf-016f5576127d","title":"Using Data To Illuminate The Intentionally Opaque Insurance Industry","url":"https://www.dataengineeringpodcast.com/coveragecat-insurance-industry-data-engineering-episode-395","content_text":"Summary\n\nThe insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nAs more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what CoverageCat is and the story behind it?\nWhat are the different sources of data that you work with?\n\n\nWhat are the most challenging aspects of collecting that data?\nCan you describe the formats and characteristics (3 Vs) of that data?\n\nWhat are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective?\nCan you describe how you have architected your data platform?\n\n\nHow have the design and goals changed since you first started working on it?\nWhat are you optimizing for in your selection and implementation process?\n\nWhat are the sharp edges/weak points that you worry about in your existing data flows?\n\n\nHow do you guard against those flaws in your day-to-day operations?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen your data sets used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on insurance industry data?\nWhen is a purely statistical view of insurance the wrong approach?\nWhat do you have planned for the future of CoverageCat's data stack?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nCoverageCat\nActuarial Model\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png)\r\n\r\nNODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks:\r\n- Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript\r\n- Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI)\r\n- Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)\r\n\r\nDon’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to <u>[Neo4j.com/NODES](https://Neo4j.com/NODES)</u> today to see the full agenda and register!Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.","date_published":"2023-10-08T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0b3bf255-c4bc-4126-9ddf-016f5576127d.mp3","mime_type":"audio/mpeg","size_in_bytes":32411719,"duration_in_seconds":3118}]},{"id":"78c06dfd-eaf0-4669-abc4-eb53be35bda9","title":"Building ETL Pipelines With Generative AI","url":"https://www.dataengineeringpodcast.com/building-etl-pipelines-with-generative-ai-episodde-394","content_text":"Summary\n\nArtificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nAs more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES.\nYour host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the different aspects/types of ETL that you are seeing generative AI applied to?\n\n\nWhat kind of impact are you seeing in terms of time spent/quality of output/etc.?\n\nWhat kinds of projects are most likely to benefit from the application of generative AI?\nCan you describe what a typical workflow of using AI to build ETL workflows looks like?\n\n\nWhat are some of the types of errors that you are likely to experience from the AI?\nOnce the pipeline is defined, what does the ongoing maintenance look like?\nIs the AI required to operate within the pipeline in perpetuity?\n\nFor individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address?\nWhat are the most interesting, innovative, or unexpected ways that you have seen generative AI used in ETL workflows?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on ETL and generative AI?\nWhen is AI the wrong choice for ETL applications?\nWhat are your predictions for future applications of AI in ETL and other data engineering practices?\n\n\nContact Info\n\n\nLinkedIn\n@MishraJay on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nAstera\nData Vault\nStar Schema\nOpenAI\nGPT == Generative Pre-trained Transformer\nEntity Resolution\nLLAMA\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Neo4J: ![NODES Conference Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/PKCipYsh.png)\r\n\r\nNODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks:\r\n- Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript\r\n- Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI)\r\n- Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)\r\n\r\nDon’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to <u>[Neo4j.com/NODES](https://Neo4j.com/NODES)</u> today to see the full agenda and register!","content_html":"

Summary

\n\n

Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.","date_published":"2023-10-01T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/78c06dfd-eaf0-4669-abc4-eb53be35bda9.mp3","mime_type":"audio/mpeg","size_in_bytes":25286334,"duration_in_seconds":3096}]},{"id":"0c3680ea-c45e-4a7f-a226-05d50628d6da","title":"Powering Vector Search With Real Time And Incremental Vector Indexes","url":"https://www.dataengineeringpodcast.com/real-time-vector-search-episode-393","content_text":"Summary\n\nThe rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nIf you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!\nYour host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what vector search is and how it differs from other search technologies?\n\n\nWhat are the technical challenges related to providing vector search?\nWhat are the applications for vector search that merit the added complexity?\n\nVector databases have been gaining a lot of attention recently with the proliferation of LLM applications. Is a dedicated database technology required to support vector indexes/vector search queries?\n\n\nWhat are the use cases for native vector data types that are separate from AI?\n\nWith the increasing usage of vectors for data and AI/ML applications, who do you typically see as the owner of that problem space? (e.g. data engineers, ML engineers, data scientists, etc.)\nFor teams who are investing in vector search, what are the architectural considerations that they need to be aware of?\n\n\nHow does it impact the data pipeline strategies/topologies used?\n\nWhat are the complexities that need to be addressed when updating vector data in a real-time/streaming fashion?\n\n\nHow does that influence the client strategies that are querying that data?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen vector search used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on vector search applications?\nWhen is vector search the wrong choice?\nWhat do you see as future potential applications for vector indexes/vector search?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. The Machine Learning Podcast helps you go from idea to production with machine learning. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nRockset\n\n\nPodcast Episode\n\nVector Index\nVector Search\n\n\nRockset Implementation Explanation\n\nVector Space\nEuclidean Distance\nOLAP == Online Analytical Processing\nOLTP == Online Transaction Processing\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png)\r\n\r\nHex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.","date_published":"2023-09-24T20:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0c3680ea-c45e-4a7f-a226-05d50628d6da.mp3","mime_type":"audio/mpeg","size_in_bytes":35356149,"duration_in_seconds":3556}]},{"id":"dac0cd8b-c50c-4823-9838-51be49eaa8d0","title":"Building Linked Data Products With JSON-LD","url":"https://www.dataengineeringpodcast.com/linked-data-products-json-ld-episode-392","content_text":"Summary\n\nA significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nIf you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!\nYour host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the term \"linked data product\" means and some examples of when you might build one?\n\n\nWhat is the overlap between knowledge graphs and \"linked data products\"?\n\nWhat is JSON-LD?\n\n\nWhat are the domains in which it is typically used?\nHow does it assist in developing linked data products?\n\nwhat are the characteristics that distinguish a knowledge graph from \nWhat are the layers/stages of applications and data that can/should incorporate JSON-LD as the representation for records and events?\n\n\nWhat is the level of native support/compatibiliity that you see for JSON-LD in data systems?\n\nWhat are the modeling exercises that are necessary to ensure useful and appropriate linkages of different records within and between products and organizations?\nCan you describe the workflow for building autonomous linkages across data assets that are modelled as JSON-LD?\nWhat are the most interesting, innovative, or unexpected ways that you have seen JSON-LD used for data workflows?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on linked data products?\nWhen is JSON-LD the wrong choice?\nWhat are the future directions that you would like to see for JSON-LD and linked data in the data ecosystem?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nFluree\nJSON-LD\nKnowledge Graph\nAdjacency List\nRDF == Resource Description Framework\nSemantic Web\nOpen Graph\nSchema.org\nRDF Triple\nIDMP == Identification of Medicinal Products\nFIBO == Financial Industry Business Ontology\nOWL Standard\nNP-Hard\nForward-Chaining Rules\nSHACL == Shapes Constraint Language)\nZero Knowledge Cryptography\nTurtle Serialization\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png)\r\n\r\nHex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.","date_published":"2023-09-17T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/dac0cd8b-c50c-4823-9838-51be49eaa8d0.mp3","mime_type":"audio/mpeg","size_in_bytes":35619311,"duration_in_seconds":3690}]},{"id":"a1a42d3a-f643-4dfc-a3ec-7da1e01927e9","title":"An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem ","url":"https://www.dataengineeringpodcast.com/state-of-data-orchestration-episode-391","content_text":"Summary\n\nData systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.)\nWhat are the misconceptions about the applications of/need for/cost to implement data orchestration?\n\n\nHow do those challenges of customer education change across roles/personas?\n\nBecause of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine?\nYou have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time?\nOne of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data orchestration?\nWhen is a data orchestrator the wrong choice?\nWhat do you have planned for the future of orchestration with Dagster?\n\n\nContact Info\n\n\n@schrockn on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDagster\nGraphQL\nK8s == Kubernetes\nAirbyte\n\n\nPodcast Episode\n\nHightouch\n\n\nPodcast Episode\n\nAirflow\nPrefect\nFlyte\n\n\nPodcast Episode\n\ndbt\n\n\nPodcast Episode\n\nDAG == Directed Acyclic Graph\nTemporal\nSoftware Defined Assets\nDataForm\nGradient Flow State Of Orchestration Report 2022\nMLOps Is 98% Data Engineering\nDataHub\n\n\nPodcast Episode\n\nOpenMetadata\n\n\nPodcast Episode\n\nAtlan\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.","date_published":"2023-09-10T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a1a42d3a-f643-4dfc-a3ec-7da1e01927e9.mp3","mime_type":"audio/mpeg","size_in_bytes":38079193,"duration_in_seconds":3685}]},{"id":"299b15cf-cebf-40ba-aad1-c1eac9becc54","title":"Eliminate The Overhead In Your Data Integration With The Open Source dlt Library","url":"https://www.dataengineeringpodcast.com/dlt-data-integration-library-episode-390","content_text":"Summary\n\nCloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYour host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what dlt is and the story behind it?\n\n\nWhat is the problem you want to solve with dlt?\nWho is the target audience?\n\nThe obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt?\nCan you describe how dlt is implemented?\nWhat are the benefits of building it in Python?\nHow have the design and goals of the project changed since you first started working on it?\nHow does that language choice influence the performance and scaling characteristics?\nWhat problems do users solve with dlt?\nWhat are the interfaces available for extending/customizing/integrating with dlt?\nCan you talk through the process of adding a new source/destination?\nWhat is the workflow for someone building a pipeline with dlt?\nHow does the experience scale when supporting multiple connections?\nGiven the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to the naming). What are the benefits of using those tools in combination?\nWhat are the most interesting, innovative, or unexpected ways that you have seen dlt used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?\nWhen is dlt the wrong choice?\nWhat do you have planned for the future of dlt?\n\n\nContact Info\n\n\nLinkedIn\nJoin our community to discuss further\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\ndlt\n\n\nHarness Success Story\n\nOur guiding product principles\nEcosystem support\nFrom basic to complex, dlt has many capabilities\nSinger\nAirbyte\n\n\nPodcast Episode\n\nMeltano\n\n\nPodcast Episode\n\nMatillion\n\n\nPodcast Episode\n\nFivetran\n\n\nPodcast Episode\n\nDuckDB\n\n\nPodcast Episode\n\nOpenAPI\nData Mesh\n\n\nPodcast Episode\n\nSQLMesh\n\n\nPodcast Episode\n\nAirflow\nDagster\n\n\nPodcast Episode\n\nPrefect\n\n\nPodcast Episode\n\nAlto\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.","date_published":"2023-09-03T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/299b15cf-cebf-40ba-aad1-c1eac9becc54.mp3","mime_type":"audio/mpeg","size_in_bytes":29098009,"duration_in_seconds":2532}]},{"id":"6bf276bf-9768-4b08-bda6-98770c9d1e05","title":"Building An Internal Database As A Service Platform At Cloudflare","url":"https://www.dataengineeringpodcast.com/cloudflare-postgres-database-as-a-service-episode-389","content_text":"Summary\n\nData persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm interviewing Vignesh Ravichandran about building an internal database as a service platform at Cloudflare\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the different database workloads that you have at Cloudflare?\n\n\nWhat are the different methods that you have used for managing database instances?\n\nWhat are the requirements and constraints that you had to account for in designing your current system?\nWhy Postgres?\noptimizations for Postgres\n\n\nsimplification from not supporting multiple engines\n\nlimitations in postgres that make multi-tenancy challenging\nscale of operation (data volume, request rate\nWhat are the most interesting, innovative, or unexpected ways that you have seen your DBaaS used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on your internal database platform?\nWhen is an internal database as a service the wrong choice?\nWhat do you have planned for the future of Postgres hosting at Cloudflare?\n\n\nContact Info\n\n\nLinkedIn\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nCloudflare\nPostgreSQL\n\n\nPodcast Episode\n\nIP Address Data Type in Postgres\nCockroachDB\n\n\nPodcast Episode\n\nCitus\n\n\nPodcast Episode\n\nYugabyte\n\n\nPodcast Episode\n\nStolon\npg_rewind\nPGBouncer\nHAProxy Presentation\nEtcd\nPatroni\npg_upgrade\nEdge Computing\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!","content_html":"

Summary

\n\n

Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.","date_published":"2023-08-27T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6bf276bf-9768-4b08-bda6-98770c9d1e05.mp3","mime_type":"audio/mpeg","size_in_bytes":40555415,"duration_in_seconds":3669}]},{"id":"e41b131b-ae14-478b-adf6-b2e108406a50","title":"Harnessing Generative AI For Creating Educational Content With Illumidesk","url":"https://www.dataengineeringpodcast.com/illumidesk-generative-ai-educational-platform-episode-388","content_text":"Summary\n\nGenerative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYou shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!\nYour host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Illumidesk is and the story behind it?\nWhat are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences?\nHow are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware?\nWhat are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors?\nWhat are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports?\nWhat are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content?\nCan you describe how you have architected the Illumidesk platform?\nHow have the design and goals shifted since you first began working on it?\nWhat are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities?\nWhat are the failure modes of the content generation that you need to account for?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Illumidesk used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Illumidesk?\nWhen is Illumidesk the wrong choice?\nWhat do you have planned for the future of Illumidesk?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nIllumidesk\nGenerative AI\nVector Database\nLTI == Learning Tools Interoperability\nSCORM\nXAPI\nPrompt Engineering\nGPT-4\nLLama\nAnthropic\nFastAPI\nLangChain\nCelery\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nYou shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.\r\n\r\nThat is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u> today and get 2 weeks free!Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.","date_published":"2023-08-20T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e41b131b-ae14-478b-adf6-b2e108406a50.mp3","mime_type":"audio/mpeg","size_in_bytes":28032231,"duration_in_seconds":3292}]},{"id":"4a689109-ca28-4585-8d89-dcdd65d02abe","title":"Unpacking The Seven Principles Of Modern Data Pipelines","url":"https://www.dataengineeringpodcast.com/rivery-seven-principles-modern-data-pipelines-episode-387","content_text":"Summary\n\nData pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold\nYour host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by defining what you mean by a \"modern\" data pipeline?\nAt Rivery you published a white paper identifying seven principles of modern data pipelines:\n\n\nZero infrastructure management\nELT-first mindset\nSpeaks SQL and Python\nDynamic multi-storage layers\nReverse ETL & operational analytics\nFull transparency\nFaster time to value\n\nWhat are the applications of data that you focused on while identifying these principles?\nHow do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business?\nWhat are the technical components of a pipeline infrastructure that are necessary to support a \"modern\" workflow?\nHow do the technologies involved impact the organizational involvement with how data is applied throughout the business?\nWhen using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data?\nWhat are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles?\nWhat are the cases where some/all of these principles are undesirable/impractical to implement?\nWhat are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nRivery\n7 Principles Of The Modern Data Pipeline\nELT\nReverse ETL\nMartech Landscape\nData Lakehouse\nDatabricks\nSnowflake\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nThis episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.","date_published":"2023-08-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4a689109-ca28-4585-8d89-dcdd65d02abe.mp3","mime_type":"audio/mpeg","size_in_bytes":29521464,"duration_in_seconds":2822}]},{"id":"40a518f8-4d2e-407e-ac7a-27dd92cbca25","title":"Quantifying The Return On Investment For Your Data Team","url":"https://www.dataengineeringpodcast.com/data-team-roi-episode-386","content_text":"Summary\n\nAs businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Barr Moses and Anna Filippova about how and whether to measure the ROI of your data team\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the typical motivations for measuring and tracking the ROI for a data team?\n\n\nWho is responsible for collecting that information?\nHow is that information used and by whom?\n\nWhat are some of the downsides/risks of tracking this metric? (law of unintended consequences)\nWhat are the inputs to the number that constitutes the \"investment\"? infrastructure, payroll of employees on team, time spent working with other teams?\nWhat are the aspects of data work and its impact on the business that complicate a calculation of the \"return\" that is generated?\nHow should teams think about measuring data team ROI? \nWhat are some concrete ROI metrics data teams can use?\n\n\nWhat level of detail is useful? What dimensions should be used for segmenting the calculations?\n\nHow can visibility into this ROI metric be best used to inform the priorities and project scopes of the team? \nWith so many tools in the modern data stack today, what is the role of technology in helping drive or measure this impact? \nHow do your respective solutions, Monte Carlo and dbt, help teams measure and scale data value? \nWith generative AI on the upswing of the hype cycle, what are the impacts that you see it having on data teams?\n\n\nWhat are the unrealistic expectations that it will produce?\nHow can it speed up time to delivery? \n\nWhat are the most interesting, innovative, or unexpected ways that you have seen data team ROI calculated and/or used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on measuring the ROI of data teams?\nWhen is measuring ROI the wrong choice?\n\n\nContact Info\n\n\nBarr\n\n\nLinkedIn\n\nAnna\n\n\nLinkedIn\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nMonte Carlo\n\n\nPodcast Episode\n\ndbt\n\n\nPodcast Episode\n\nJetBlue Snowflake Con Presentation\nGenerative AI\nLarge Language Models\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.","date_published":"2023-08-06T19:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/40a518f8-4d2e-407e-ac7a-27dd92cbca25.mp3","mime_type":"audio/mpeg","size_in_bytes":37574457,"duration_in_seconds":3712}]},{"id":"809b5635-c795-456f-9fab-0510dafc1a45","title":"Strategies For A Successful Data Platform Migration","url":"https://www.dataengineeringpodcast.com/data-platform-migrations-episode-385","content_text":"Summary\n\nAll software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nModern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!\nYour host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nA migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation?\nIs it possible to completely avoid having to invest in a migration?\nWhat are the signals that point to the need for a migration?\n\n\nWhat are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one)\nWhat are some signals that a migration is not the right solution for a perceived problem?\n\nOnce the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution?\nWhat are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)?\nWhat are some of the ways that a migration effort might fail?\nWhat are the major pitfalls that teams need to be aware of as they work through a data platform migration?\nWhat are the opportunities for automation during the migration process?\nWhat are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations?\nWhat are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?\n\n\nContact Info\n\n\nGleb\n\n\nLinkedIn\n@glebmm on Twitter\n\nRob\n\n\nLinkedIn\nRobGoretsky on GitHub\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDatafold\n\n\nPodcast Episode\n\nInformatica\nAirflow\nSnowflake\n\n\nPodcast Episode\n\nRedshift\nEventbrite\nTeradata\nBigQuery\nTrino\nEMR == Elastic Map-Reduce\nShadow IT\n\n\nPodcast Episode\n\nMode Analytics\nLooker\nSunk Cost Fallacy\ndata-diff\n\n\nPodcast Episode\n\nSQLGlot\n[Dagster](dhttps://dagster.io/)\ndbt\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png)\r\n\r\nHex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.","date_published":"2023-07-30T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/809b5635-c795-456f-9fab-0510dafc1a45.mp3","mime_type":"audio/mpeg","size_in_bytes":43028033,"duration_in_seconds":4192}]},{"id":"72597788-ff1a-48d9-a875-affc037b26af","title":"Build Real Time Applications With Operational Simplicity Using Dozer","url":"https://www.dataengineeringpodcast.com/dozer-real-time-application-framework-episode-384","content_text":"Summary\n\nReal-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nModern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!\nYour host is Tobias Macey and today I'm interviewing Matteo Pelati about Dozer, an open source engine that includes data ingestion, transformation, and API generation for real-time sources\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Dozer is and the story behind it?\n\n\nWhat was your decision process for building Dozer as open source?\n\nAs you note in the documentation, Dozer has overlap with a number of technologies that are aimed at different use cases. What was missing from each of them and the center of their Venn diagram that prompted you to build Dozer?\nIn addition to working in an interesting technological cross-section, you are also targeting a disparate group of personas. Who are you building Dozer for and what were the motivations for that vision?\n\n\nWhat are the different use cases that you are focused on supporting?\nWhat are the features of Dozer that enable engineers to address those uses, and what makes it preferable to existing alternative approaches?\n\nCan you describe how Dozer is implemented?\n\n\nHow have the design and goals of the platform changed since you first started working on it?\nWhat are the architectural \"-ilities\" that you are trying to optimize for?\n\nWhat is involved in getting Dozer deployed and integrated into an existing application/data infrastructure?\nHow can teams who are using Dozer extend/integrate with Dozer?\n\n\nWhat does the development/deployment workflow look like for teams who are building on top of Dozer?\n\nWhat is your governance model for Dozer and balancing the open source project against your business goals?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Dozer used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Dozer?\nWhen is Dozer the wrong choice?\nWhat do you have planned for the future of Dozer?\n\n\nContact Info\n\n\nLinkedIn\n@pelatimtt on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDozer\nData Robot\nNetflix Bulldozer\nCubeJS\n\n\nPodcast Episode\n\nJVM == Java Virtual Machine\nFlink\n\n\nPodcast Episode\n\nAirbyte\n\n\nPodcast Episode\n\nFivetran\n\n\nPodcast Episode\n\nDelta Lake\n\n\nPodcast Episode\n\nLMDB\nVector Database\nLLM == Large Language Model\nRockset\n\n\nPodcast Episode\n\nTinybird\n\n\nPodcast Episode\n\nRust Language\nMaterialize\n\n\nPodcast Episode\n\nRisingWave\nDuckDB\n\n\nPodcast Episode\n\nDataFusion\nPolars\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Hex: ![Hex Tech Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zBEUGheK.png)\r\n\r\nHex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.","date_published":"2023-07-23T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/72597788-ff1a-48d9-a875-affc037b26af.mp3","mime_type":"audio/mpeg","size_in_bytes":26969592,"duration_in_seconds":2442}]},{"id":"c3377ba1-2eeb-4f9d-8f61-7bae3184a4fb","title":"Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future","url":"https://www.dataengineeringpodcast.com/datapreneurs-book-bob-muglia-episode-383","content_text":"Summary\n\nData has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book \"Datapreneurs\" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Bob Muglia about his recent book about the idea of \"Datapreneurs\" and the role of data in the modern economy\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your concept of a \"Datapreneur\" is?\n\n\nHow is this distinct from the common idea of an entreprenur?\n\nWhat do you see as the key inflection points in data technologies and their impacts on business capabilities over the past ~30 years?\nIn your role as the CEO of Snowflake you had a first-row seat for the rise of the \"modern data stack\". What do you see as the main positive and negative impacts of that paradigm?\n\n\nWhat are the key issues that are yet to be solved in that ecosmnjjystem?\n\nFor technologists who are thinking about launching new ventures, what are the key pieces of advice that you would like to share?\nWhat do you see as the short/medium/long-term impact of AI on the technical, business, and societal arenas?\nWhat are the most interesting, innovative, or unexpected ways that you have seen business leaders use data to drive their vision?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the Datapreneurs book?\nWhat are your key predictions for the future impact of data on the technical/economic/business landscapes?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nDatapreneurs Book\nSQL Server\nSnowflake\nZ80 Processor\nNavigational Database\nSystem R\nRedshift\nMicrosoft Fabric\nDatabricks\nLooker\nFivetran\n\n\nPodcast Episode\n\nDatabricks Unity Catalog\nRelationalAI\n6th Normal Form\nPinecone Vector DB\n\n\nPodcast Episode\n\nPerplexity AI\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book \"Datapreneurs\" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.","date_published":"2023-07-16T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c3377ba1-2eeb-4f9d-8f61-7bae3184a4fb.mp3","mime_type":"audio/mpeg","size_in_bytes":27347492,"duration_in_seconds":3285}]},{"id":"0b63b09b-939f-4ce7-b8e6-50a0e18f7726","title":"Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling","url":"https://www.dataengineeringpodcast.com/entity-centric-data-modeling-episode-382","content_text":"Summary\n\nFor business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what entity-centric modeling (ECM) is and the story behind it?\n\n\nHow does it compare to dimensional modeling strategies?\nWhat are some of the other competing methods\nComparison to activity schema\n\nWhat impact does this have on ML teams? (e.g. feature engineering)\nWhat role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)\n\n\nWhat is the impact on the underlying compute engine on the modeling strategies used?\n\nWhat are some examples of data sources or problem domains for which this approach is well suited?\n\n\nWhat are some cases where entity centric modeling techniques might be counterproductive?\n\nWhat are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?\nWhat are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?\n\n\nHow does this work across business domains within a given organization (especially at \"enterprise\" scale)?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen ECM used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?\nWhen is ECM the wrong choice?\nWhat are your predictions for the future direction/adoption of ECM or other modeling techniques?\n\n\nContact Info\n\n\nmistercrunch on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nEntity Centric Modeling Blog Post\nMax's Previous Apperances\n\n\nDefining Data Engineering with Maxime Beauchemin\nSelf Service Data Exploration And Dashboarding With Superset\nExploring The Evolving Role Of Data Engineers\nAlumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations\n\nApache Airflow\nApache Superset\nPreset\nUbisoft\nRalph Kimball\nThe Rise Of The Data Engineer\nThe Downfall Of The Data Engineer\nThe Rise Of The Data Scientist\nDimensional Data Modeling\nStar Schema\nDatabase Normalization\nFeature Engineering\nDRY == Don't Repeat Yourself\nActivity Schema\n\n\nPodcast Episode\n\nCorporate Information Factory (affiliate link)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.","date_published":"2023-07-09T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0b63b09b-939f-4ce7-b8e6-50a0e18f7726.mp3","mime_type":"audio/mpeg","size_in_bytes":37871875,"duration_in_seconds":4374}]},{"id":"0100bf55-fe87-42b2-9f08-d1930bf0c612","title":"How Data Engineering Teams Power Machine Learning With Feature Platforms","url":"https://www.dataengineeringpodcast.com/data-engineering-ml-feature-platforms-episode-381","content_text":"Summary\n\nFeature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is feature engineering is and why/to whom it matters?\n\n\nA topic that commonly comes up in relation to feature engineering is the importance of a feature store. What are the tradeoffs for that to be a separate infrastructure/architecture component?\n\nWhat is the overall lifecycle of a feature, from definition to deployment and maintenance?\n\n\nHow is this distinct from other forms of data pipeline development and delivery?\nWho are the participants in that workflow?\n\nWhat are the sharp edges/roadblocks that typically manifest in that lifecycle?\nWhat are the interfaces that are needed for data scientists/ML engineers to be able to self-serve their feature management?\n\n\nWhat is the role of the data engineer in supporting those interfaces?\nWhat are the communication/collaboration channels that are necessary to make the overall process a success?\n\nFrom an implementation/architecture perspective, what are the patterns that you have seen teams build around for feature development/serving?\nWhat are the most interesting, innovative, or unexpected ways that you have seen feature platforms used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on feature engineering?\nWhat are the resources that you find most helpful in understanding and designing feature platforms?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nFeatureByte\nDataRobot\nFeature Store\nFeast Feature Store\nFeathr\nKaggle\nYann LeCun\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)","content_html":"

Summary

\n\n

Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.","date_published":"2023-07-03T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0100bf55-fe87-42b2-9f08-d1930bf0c612.mp3","mime_type":"audio/mpeg","size_in_bytes":30870019,"duration_in_seconds":3809}]},{"id":"35496260-a618-4177-8a6f-4c1c797d042e","title":"Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh","url":"https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380","content_text":"Summary\n\nData transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-\nYour host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SQLMesh is and the story behind it?\n\n\nDataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?\n\nWhat are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?\n\n\nHow do those rough edges impact the productivity and effectiveness of teams using those\n\nCan you describe how SQLMesh is implemented?\n\n\nHow have the design and goals evolved since you first started working on it?\n\nWhat are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh?\nFor teams who have already invested in dbt, what is the migration path from or integration with dbt?\nYou have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator?\nWhat do you see as the potential benefits of integration with e.g. data-diff?\nWhat are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh?\nWhen is SQLMesh the wrong choice?\nWhat do you have planned for the future of SQLMesh?\n\n\nContact Info\n\n\ntobymao on GitHub\n@captaintobs on Twitter\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nSQLMesh\nTobiko Data\nSAS\nAirBnB Minerva\nSQLGlot\nCron\nAST == Abstract Syntax Tree\nPandas\nTerraform\ndbt\n\n\nPodcast Episode\n\nSQLFluff\n\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.","date_published":"2023-06-25T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/35496260-a618-4177-8a6f-4c1c797d042e.mp3","mime_type":"audio/mpeg","size_in_bytes":34305723,"duration_in_seconds":3019}]},{"id":"e0fdf3eb-a5de-4dfc-a437-bdd10dffc17c","title":"How Column-Aware Development Tooling Yields Better Data Models","url":"https://www.dataengineeringpodcast.com/coalesce-column-aware-data-architecture-episode-379","content_text":"Summary\n\nArchitectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-\nYour host is Tobias Macey and today I'm interviewing Satish Jayanthi about the practice and promise of building a column-aware data architecture through intentional modeling\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling?\n\n\nThere are ongoing conversations about the continued merits of dimensional modeling techniques in modern warehouses. What are the modeling practices that you have found to be most useful in large and complex data environments?\n\nCan you describe what you mean by the term column-aware in the context of data modeling/data architecture?\n\n\nWhat are the capabilities that need to be built into a tool for it to be effectively column-aware?\n\nWhat are some of the ways that tools like dbt miss the mark in managing large/complex transformation workloads?\nColumn-awareness is obviously critical in the context of the warehouse. What are some of the ways that that information can be fed into other contexts? (e.g. ML, reverse ETL, etc.)\nWhat is the importance of embedding column-level lineage awareness into transformation tool vs. layering on top w/ dedicated lineage/metadata tooling?\nWhat are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on building column-aware tooling?\nWhen is column-aware modeling the wrong choice?\nWhat are some additional resources that you recommend for individuals/teams who want to learn more about data modeling/column aware principles?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nCoalesce\n\n\nPodcast Episode\n\nStar Schema\nConformed Dimensions\nData Vault\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.","date_published":"2023-06-17T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e0fdf3eb-a5de-4dfc-a437-bdd10dffc17c.mp3","mime_type":"audio/mpeg","size_in_bytes":29078302,"duration_in_seconds":2779}]},{"id":"a34de226-38c9-48e9-99f5-863b1d3b8a2f","title":"Build Better Tests For Your dbt Projects With Datafold And data-diff","url":"https://www.dataengineeringpodcast.com/datafold-dbt-testing-episode-378","content_text":"Summary\n\nData engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff)\nWhat are the roadblocks to data testing/validation that you see teams run into most often?\n\n\nHow does the tooling used contribute to/help address those roadblocks?\n\nWhat are some of the error conditions/failure modes that data-diff can help identify in a dbt project?\n\n\nWhat are some examples of tests that need to be implemented by the engineer?\n\nIn your experience working with data teams, what typically constitutes the \"staging area\" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.)\nGiven a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests?\nIn application development there is the idea of the \"testing pyramid\", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects?\n\n\nWhat are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be?\n\nBeyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.)\nWhat are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers?\nWhen is Datafold/data-diff the wrong choice for dbt projects?\nWhat do you have planned for the future of Datafold?\n\n\nContact Info\n\n\nLinkedIn\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDatafold\n\n\nPodcast Episode\n\ndata-diff\n\n\nPodcast Episode\n\ndbt\nDagster\ndbt-cloud slim CI\nGitHub Actions\nJenkins\nCircle CI\nDolt\nMalloy\nLakeFS\nPlanetscale\nSnowflake Zero Copy Cloning\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASpecial Guest: Gleb Mezhanskiy.Sponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Closing Announcements

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Special Guest: Gleb Mezhanskiy.

Sponsored By:

","summary":"Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.","date_published":"2023-06-11T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a34de226-38c9-48e9-99f5-863b1d3b8a2f.mp3","mime_type":"audio/mpeg","size_in_bytes":25301315,"duration_in_seconds":2901}]},{"id":"eeb361fc-b683-4173-b6d4-6a8cc640d342","title":"Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service","url":"https://www.dataengineeringpodcast.com/agile-data-engine-dataops-platform-episode-377","content_text":"Summary\n\nA significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Agile Data Engine is and the story behind it?\nWhat are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?\n\n\nHow does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data?\nWhat are some of the types of experiments that are enabled by reduced operational overhead?\n\nWhat does CI/CD look like for a data warehouse?\n\n\nHow is it different from CI/CD for software applications?\n\nCan you describe how Agile Data Engine is architected?\n\n\nHow have the design and goals of the system changed since you first started working on it?\nWhat are the components that you needed to develop in-house to enable your platform goals?\n\nWhat are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption?\nCan you describe the workflow for a team that is using Agile Data Engine to power their business analytics?\n\n\nWhat are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?\n\nIn your \"about\" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry?\nHow have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine?\nWhen is Agile Data Engine the wrong choice?\nWhat do you have planned for the future of Agile Data Engine?\n\n\nGuest Contact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nAbout Agile Data Engine\n\nAgile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world.\nAgile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.\n\nLinks\n\n\nAgile Data Engine\nBill Inmon\nRalph Kimball\nSnowflake\nRedshift\nBigQuery\nAzure Synapse\nAirflow\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Guest Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

About Agile Data Engine

\n\n

Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world.
\nAgile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.

\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.","date_published":"2023-06-04T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/eeb361fc-b683-4173-b6d4-6a8cc640d342.mp3","mime_type":"audio/mpeg","size_in_bytes":36404525,"duration_in_seconds":3245}]},{"id":"77a70af0-6fc2-4bb0-93cc-a95a9b60cabf","title":"A Roadmap To Bootstrapping The Data Team At Your Startup","url":"https://www.dataengineeringpodcast.com/ghalib-suleiman-startup-data-teams-episode-376","content_text":"Summary\n\nBuilding a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Ghalib Suleiman about challenges and strategies for building data teams in a startup\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your conception of the responsibilities of a data team?\nWhat are some of the common fallacies that organizations fall prey to in their first efforts at building data capabilities?\n\n\nHave you found it more practical to hire outside talent to build out the first data systems, or grow that talent internally?\nWhat are some of the resources you have found most helpful in training/educating the early creators and consumers of data assets?\n\nWhen there is no internal data talent to assist with hiring, what are some of the problems that manifest in the hiring process?\n\n\nWhat are the concepts that the new hire needs to know?\nHow much does the hiring manager/interviewer need to know about those concepts to evaluate skill?\n\nWhat are the most critical skills for a first hire to have to start generating valuable output?\nAs a solo data person, what are the uphill battles that they need to be prepared for in the organization?\n\n\nWhat are the rabbit holes that they should beware of?\n\nWhat are some of the tactical \nWhat are the most interesting, innovative, or unexpected ways that you have seen initial data hires tackle startup challenges?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on starting and growing data teams?\nWhen is it more practical to outsource the data work?\n\n\nContact Info\n\n\nLinkedIn\n@ghalib on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nPolytomic\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.","date_published":"2023-05-28T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/77a70af0-6fc2-4bb0-93cc-a95a9b60cabf.mp3","mime_type":"audio/mpeg","size_in_bytes":27960151,"duration_in_seconds":2551}]},{"id":"afd64b4c-1be6-4e9c-99c5-801309bcce5c","title":"Keep Your Data Lake Fresh With Real Time Streams Using Estuary","url":"https://www.dataengineeringpodcast.com/estuary-real-time-streaming-data-lake-episode-375","content_text":"Summary\n\nBatch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Estuary is and the story behind it?\nStream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem?\n\n\nWhat was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch?\n\nWith the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming?\n\n\nWhat is the comparative level of difficulty and support for these disparate paradigms?\n\nWhat is the impact of continuous data flows on dags/orchestration of transforms?\nWhat role do modern table formats have on the viability of real-time data lakes?\nCan you describe the architecture of your Flow platform?\n\n\nWhat are the core capabilities that you are optimizing for in its design?\n\nWhat is involved in getting Flow/Estuary deployed and integrated with an organization's data systems?\nWhat does the workflow look like for a team using Estuary?\n\n\nHow does it impact the overall system architecture for a data platform as compared to other prevalent paradigms?\n\nHow do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Estuary used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary?\nWhen is Estuary the wrong choice?\nWhat do you have planned for the future of Estuary?\n\n\nContact Info\n\n\nDave Y\nJohnny G\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nEstuary\nTry Flow Free\nGazette\nSamza\nFlink\n\n\nPodcast Episode\n\nStorm\nKafka Topic Partitioning\nTrino\nAvro\nParquet\nFivetran\n\n\nPodcast Episode\n\nAirbyte\nSnowflake\nBigQuery\nVector Database\nCDC == Change Data Capture\nDebezium\n\n\nPodcast Episode\n\nMapReduce\nNetflix DBLog\nJSON-Schema\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.","date_published":"2023-05-21T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/afd64b4c-1be6-4e9c-99c5-801309bcce5c.mp3","mime_type":"audio/mpeg","size_in_bytes":27204225,"duration_in_seconds":3350}]},{"id":"3d5ead92-3b43-4074-a6ea-dde9d47c4b92","title":"What Happens When The Abstractions Leak On Your Data","url":"https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374","content_text":"Summary\n\nAll of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow\n\n\nInterview\n\n\nIntroduction\nimpact of community tech debt\n\n\nhive metastore\nnew work being done but not widely adopted\n\ntensions between automation and correctness\ndata type mapping\n\n\ninteger types\ncomplex types\nnaming things (keys/column names from APIs to databases)\n\ndisaggregated databases - pros and cons\n\n\nflexibility and cost control\nnot as much tooling invested vs. Snowflake/BigQuery/Redshift\n\ndata modeling\n\n\ndimensional modeling vs. answering today's questions\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform?\nWhen is ELT the wrong choice?\nWhat do you have planned for the future of your data platform?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\ndbt\nAirbyte\n\n\nPodcast Episode\n\nDagster\n\n\nPodcast Episode\n\nTrino\n\n\nPodcast Episode\n\nELT\nData Lakehouse\nSnowflake\nBigQuery\nRedshift\nTechnical Debt\nHive Metastore\nAWS Glue\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.","date_published":"2023-05-14T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3d5ead92-3b43-4074-a6ea-dde9d47c4b92.mp3","mime_type":"audio/mpeg","size_in_bytes":19406661,"duration_in_seconds":1601}]},{"id":"ea1fa319-da05-443b-8729-5e66b6d4d345","title":"Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify","url":"https://www.dataengineeringpodcast.com/segment-unify-customer-profile-episode-episode-373","content_text":"Summary\n\nEvery business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Segment Unify is and the story behind it? \nWhat are the net-new capabilities that it brings to the Segment product suite?\nWhat are some of the categories of attributes that need to be managed in a prototypical customer profile?\nWhat are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile?\n\n\nWhat is the potential impact of more detailed customer profiles on LTV?\n\nHow do you manage permissions/auditability of updating or amending profile data?\nCan you describe how the Unify product is implemented?\n\n\nWhat are the technical challenges that you had to address while developing/launching this product?\n\nWhat is the workflow for a team who is adopting the Unify product?\n\n\nWhat are the other Segment products that need to be in use to take advantage of Unify?\n\nWhat are some of the most complex edge cases to address in identity resolution?\nHow does reverse ETL factor into the enrichment process for profile data?\nWhat are some of the issues that you have to account for in synchronizing profiles across platforms/products?\n\n\nHow do you mititgate the impact of \"regression to the mean\" for systems that don't support all of the attributes that you want to maintain in a profile record?\n\nWhat are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify?\nWhen is Segment Unify the wrong choice?\nWhat do you have planned for the future of Segment Unify?\n\n\nContact Info\n\n\nKevin\n\n\nLinkedIn\nBlog\n\nHanhan\n\n\nLinkedIn\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nSegment Unify\nSegment\n\n\nPodcast Episode\n\nCustomer Data Platform (CDP)\nGolden Profile\nReverse ETL\nMarTech Landscape\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.","date_published":"2023-05-07T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ea1fa319-da05-443b-8729-5e66b6d4d345.mp3","mime_type":"audio/mpeg","size_in_bytes":35225637,"duration_in_seconds":3274}]},{"id":"8c0728cc-95ea-4719-be6d-8bf8b976c3a0","title":"Realtime Data Applications Made Easier With Meroxa","url":"https://www.dataengineeringpodcast.com/meroxa-real-time-data-applications-episode-372","content_text":"Summary\n\nReal-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Meroxa is and the story behind it?\n\n\nHow have the focus and goals of the platform and company evolved over the past 2 years?\n\nWho are the target customers for Meroxa?\n\n\nWhat problems are they trying to solve when they come to your platform?\n\nApplications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams?\n\n\nWhat are some of the remaining blockers for teams who want to start using real-time data?\n\nWith the democratization of real-time data, what are the new categories of products and applications that are being unlocked?\n\n\nHow are organizations thinking about the potential value that those types of apps/services can provide?\n\nWith data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it?\n\n\nWhat are some of the technical controls that are available for organizations that are risk-averse?\n\nWhat skills do developers need to be able to effectively design, develop, and deploy real-time data applications?\n\n\nHow does this differ when talking about internal vs. consumer/end-user facing applications?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Meroxa used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa?\nWhen is Meroxa the wrong choice?\nWhat do you have planned for the future of Meroxa?\n\n\nContact Info\n\n\nLinkedIn\n@devarispbrown on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nMeroxa\n\n\nPodcast Episode\n\nKafka\nKafka Connect\nConduit - golang Kafka connect replacement\nPulsar\nRedpanda\nFlink\nBeam\nClickhouse\nDruid\nPinot\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.","date_published":"2023-04-23T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8c0728cc-95ea-4719-be6d-8bf8b976c3a0.mp3","mime_type":"audio/mpeg","size_in_bytes":30778453,"duration_in_seconds":2726}]},{"id":"60c936b3-cf38-4d12-b6ea-862d8b1262e8","title":"Building Self Serve Business Intelligence With AI And Semantic Modeling At Zenlytic","url":"https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371","content_text":"Summary\n\nBusiness intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Paul Blankley and Ryan Janssen about Zenlytic, a no-code business intelligence tool focused on emerging commerce brands\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Zenlytic is and the story behind it?\nBusiness intelligence is a crowded market. What was your process for defining the problem you are focused on solving and the method to achieve that outcome?\nSelf-serve data exploration has been attempted in myriad ways over successive generations of BI and data platforms. What are the barriers that have been the most challenging to overcome in that effort?\n\n\nWhat are the elements that are coming together now that give you confidence in being able to deliver on that?\n\nCan you describe how Zenlytic is implemented?\n\n\nWhat are the evolutions in the understanding and implementation of semantic layers that provide a sufficient substrate for operating on?\nHow have the recent breakthroughs in large language models (LLMs) improved your ability to build features in Zenlytic?\nWhat is your process for adding domain semantics to the operational aspect of your LLM?\n\nFor someone using Zenlytic, what is the process for getting it set up and integrated with their data?\nOnce it is operational, can you describe some typical workflows for using Zenlytic in a business context?\n\n\nWho are the target users?\nWhat are the collaboration options available?\n\nWhat are the most complex engineering/data challenges that you have had to address in building Zenlytic?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Zenlytic used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Zenlytic?\nWhen is Zenlytic the wrong choice?\nWhat do you have planned for the future of Zenlytic?\n\n\nContact Info\n\n\nPaul Blankley (LinkedIn)\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nZenlytic\nOLAP Cube\nLarge Language Model\nStarburst\nPrompt Engineering\nChatGPT\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.","date_published":"2023-04-16T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/60c936b3-cf38-4d12-b6ea-862d8b1262e8.mp3","mime_type":"audio/mpeg","size_in_bytes":27951520,"duration_in_seconds":2959}]},{"id":"511d188c-e626-4b34-afa2-1924458483c4","title":"An Exploration Of The Composable Customer Data Platform","url":"https://www.dataengineeringpodcast.com/composable-cdp-at-autotrader-episode-370","content_text":"Summary\n\nThe customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack\nYour host is Tobias Macey and today I'm interviewing Darren Haken and Tejas Manohar about building a composable CDP and how you can start adopting it incrementally\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you mean by a \"composable CDP\"?\n\n\nWhat are some of the key ways that it differs from the ways that we think of a CDP today?\n\nWhat are the problems that you were focused on addressing at Autotrader that are solved by a CDP?\nOne of the promises of the first generation CDP was an opinionated way to model your data so that non-technical teams could own this responsibility. What do you see as the risks/tradeoffs of moving CDP functionality into the same data stack as the rest of the organization?\n\n\nWhat about companies that don't have the capacity to run a full data infrastructure?\n\nBeyond the core technology of the data warehouse, what are the other evolutions/innovations that allow for a CDP experience to be built on top of the core data stack?\nadded burden on core data teams to generate event-driven data models\nWhen iterating toward a CDP on top of the core investment of the infrastructure to feed and manage a data warehouse, what are the typical first steps?\n\n\nWhat are some of the components in the ecosystem that help to speed up the time to adoption? (e.g. pre-built dbt packages for common transformations, etc.)\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen CDPs implemented?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on CDP related functionality?\nWhen is a CDP (composable or monolithic) the wrong choice?\nWhat do you have planned for the future of the CDP stack?\n\n\nContact Info\n\n\nDarren\n\n\nLinkedIn\n@DarrenHaken on Twitter\n\nTejas\n\n\nLinkedIn\n@tejasmanohar on Twitter\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nAutotrader\nHightouch\n\n\nCustomer Studio\n\nCDP == Customer Data Platform\nSegment\n\n\nPodcast Episode\n\nmParticle\nSalesforce\nAmplitude\nSnowplow\n\n\nPodcast Episode\n\nReverse ETL\ndbt\n\n\nPodcast Episode\n\nSnowflake\n\n\nPodcast Episode\n\nBigQuery\nDatabricks\nELT\nFivetran\n\n\nPodcast Episode\n\nDataHub\n\n\nPodcast Episode\n\nAmundsen\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n\n

The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack.","date_published":"2023-04-09T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/511d188c-e626-4b34-afa2-1924458483c4.mp3","mime_type":"audio/mpeg","size_in_bytes":42502655,"duration_in_seconds":4302}]},{"id":"72071186-5b5a-4132-b546-29261599d3bf","title":"Mapping The Data Infrastructure Landscape As A Venture Capitalist","url":"https://www.dataengineeringpodcast.com/mad-landscape-2023-data-infrastructure-episode-369","content_text":"Summary\n\nThe data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nBusinesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more\nYour host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the MAD landscape report is and the story behind it?\n\n\nAt a high level, what is your goal in the compilation and maintenance of your landscape document?\nWhat are your guidelines for what to include in the landscape?\n\nAs the data landscape matures, how have you seen that influence the types of projects/companies that are founded?\n\n\nWhat are the product categories that were only viable when capital was plentiful and easy to obtain?\nWhat are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive?\n\nThe rapid growth and proliferation of data tools helped establish the \"Modern Data Stack\" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the \"Modern Data Stack\" will evolve?\n\n\nIs there a different architectural paradigm that you see as growing to take its place?\n\nHow has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~\nWhat are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years?\nWhat do you have planned for future iterations of the MAD landscape?\n\n\nContact Info\n\n\nWebsite\n@mattturck on Twitter\nMAD Landscape Comments Email\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nMAD Landscape\nFirst Mark Capital\nBayesian Learning\nAI Winter\nDatabricks\nCloud Native Landscape\nLUMA Scape\nHadoop Ecosystem\nModern Data Stack\nReverse ETL\nGenerative AI\ndbt\nTransform\n\n\nPodcast Episode\n\nSnowflake IPO\nDataiku\nIceberg\n\n\nPodcast Episode\n\nHudi\n\n\nPodcast Episode\n\nDuckDB\n\n\nPodcast Episode\n\nTrino\nY42\n\n\nPodcast Episode\n\nMozart Data\n\n\nPodcast Episode\n\nKeboola\nMPP Database\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\nBusinesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn more","content_html":"

Summary

\n\n

The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.","date_published":"2023-04-02T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/72071186-5b5a-4132-b546-29261599d3bf.mp3","mime_type":"audio/mpeg","size_in_bytes":29551548,"duration_in_seconds":3717}]},{"id":"f84206f5-2488-4657-879f-c3eee3ba597c","title":"Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite","url":"https://www.dataengineeringpodcast.com/grainite-streaming-data-application-platform-episode-368","content_text":"Summary\n\nThe promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nBusinesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more\nHey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.\nJoin in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today\nYour host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Grainite is and the story behind it?\nWhat are the personas that you are focused on addressing with Grainite?\nWhat are some of the most complex aspects of building streaming data applications in the absence of something like Grainite?\n\n\nHow does Grainite work to reduce that complexity?\n\nWhat are some of the commonalities that you see in the teams/organizations that find their way to Grainite?\nWhat are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture?\nCan you describe how Grainite is architected?\n\n\nHow have the design and goals of the platform changed/evolved since you first started working on it?\n\nWhat does your internal build vs. buy process look like for identifying where to spend your engineering resources?\nWhat is the process for getting Grainite set up and integrated into an organizations technical environment?\n\n\nWhat is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product?\n\nOnce Grainite is running, can you describe the day 0 workflow of building an application or data flow?\n\n\nWhat are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Grainite used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite?\nWhen is Grainite the wrong choice?\nWhat do you have planned for the future of Grainite?\n\n\nContact Info\n\n\nAshish\n\n\nLinkedIn\n\nAbhishek\n\n\nLinkedIn\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nGrainite\n\n\nBlog about the challenges of streaming architectures\nGetting Started Docs\n\nBigTable\nSpanner\nFirestore\nOpenCensus\nCitrix\nNetScaler\nJ2EE\nRocksDB\nPulsar\nSQL Server\nMySQL\nRAFT Protocol\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\nJoin us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) Promo Code: dataengpod20Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\nBusinesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn moreTimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png)\r\nTimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.\r\n\r\nYou can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.\r\n\r\nGo to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!","content_html":"

Summary

\n\n

The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.","date_published":"2023-03-25T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f84206f5-2488-4657-879f-c3eee3ba597c.mp3","mime_type":"audio/mpeg","size_in_bytes":47888123,"duration_in_seconds":4413}]},{"id":"02f26ef7-877b-4733-a4c0-94c841f36c72","title":"Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed","url":"https://www.dataengineeringpodcast.com/satori-data-security-platform-data-productivity-episode-367","content_text":"Summary\n\nAs with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nJoin in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today\nRudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder\nHey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.\nYour host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nData security is a very broad term. Can you start by enumerating some of the different concerns that are involved?\nHow has the scope and complexity of implementing security controls on data systems changed in recent years?\n\n\nIn your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?\n\nWhat are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?\n\n\nHow much of the problem is technical vs. procedural/organizational?\n\nAs a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)\n\n\nWhat are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)\n\nWhat are some of the ways that data security and organizational productivity are at odds with each other?\n\n\nWhat are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?\n\nWhat are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?\n\n\nHow does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use?\nHow can education about the motivations for different security practices improve compliance and user experience?\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology?\nWhat are the areas of data security that still need improvements?\n\n\nContact Info\n\n\nYoav Cohen\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nSatori\n\n\nPodcast Episode\n\nData Masking\nRBAC == Role Based Access Control\nABAC == Attribute Based Access Control\nGartner Data Security Platform Report\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\nBusinesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn moreData Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\nJoin us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) Promo Code: dataengpod20TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png)\r\nTimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.\r\n\r\nYou can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.\r\n\r\nGo to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!","content_html":"

Summary

\n\n

As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.","date_published":"2023-03-18T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/02f26ef7-877b-4733-a4c0-94c841f36c72.mp3","mime_type":"audio/mpeg","size_in_bytes":27411989,"duration_in_seconds":3098}]},{"id":"f64561d8-46b0-4a75-8abe-7f1480d825ce","title":"Use Your Data Warehouse To Power Your Product Analytics With NetSpring","url":"https://www.dataengineeringpodcast.com/netspring-data-warehouse-product-analytics-episode-366","content_text":"Summary\n\nWith the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nJoin in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nYour host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what NetSpring is and the story behind it?\n\n\nWhat are the activities that constitute \"product analytics\" and what are the roles/teams involved in those activities?\n\nWhen teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ?\nCan you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries?\n\n\nHow does a warehouse-native approach simplify that effort?\n\nThere are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem?\nHow is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks?\n\n\nWhat are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring?\nHow have the goals and implementation of the NetSpring platform evolved from when you first started working on it?\n\nCan you describe the steps involved in integrating NetSpring with an organization's existing warehouse?\n\n\nWhat are the signals that NetSpring uses to understand the customer journeys of different organizations?\nHow do you manage the variance of the data models in the warehouse while providing a consistent experience for your users?\n\nGiven that you are a product organization, how are you using NetSpring to power NetSpring?\nWhat are the most interesting, innovative, or unexpected ways that you have seen NetSpring used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring?\nWhen is NetSpring the wrong choice?\nWhat do you have planned for the future of NetSpring?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nNetSpring\nThoughtSpot\nProduct Analytics\nAmplitude\nMixpanel\nCustomer Data Platform\nGDPR\nCCPA\nSegment\n\n\nPodcast Episode\n\nRudderstack\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png)\r\nTimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.\r\n\r\nYou can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.\r\n\r\nGo to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png)\r\nJoin us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) Promo Code: dataengpod20","content_html":"

Summary

\n\n

With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.","date_published":"2023-03-10T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f64561d8-46b0-4a75-8abe-7f1480d825ce.mp3","mime_type":"audio/mpeg","size_in_bytes":32692573,"duration_in_seconds":2961}]},{"id":"f6968b67-ac04-4e05-8738-e3efa591b73e","title":"Exploring The Nuances Of Building An Intentional Data Culture","url":"https://www.dataengineeringpodcast.com/data-council-data-culture-track-episode-365","content_text":"Summary\n\nThe ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nHey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.\nYour host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your working definition of \"Data Culture\" is?\n\n\nIn what ways is a data culture distinct from an organization's corporate culture? How are they interdependent?\nWhat are the elements that are most impactful in forming the data culture of an organization?\n\nWhat are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture?\n\n\nAre there any strategies that you have found helpful in counteracting those tendencies?\n\nIn terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes?\n\n\nWhat are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track?\n\nWhat are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture?\nWhat are some of the risks that you considered when forming this track and evaluating proposals?\nWhat are your criteria for determining whether this track is successful?\nWhat are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event?\nWhat do you have planned for the future of this topic at Data Council events?\n\n\nContact Info\n\n\nPete\n\n\n@petesoder on Twitter\nLinkedIn\n\nMaggie\n\n\nLinkedIn\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nData Council\n\n\nPodcast Episode\n\nData Community Fund\nDataHub\n\n\nPodcast Episode\n\nDatabase Design For Mere Mortals by Michael J. Hernandez (affiliate link)\nSOAP\nREST\nEconometrics\nDBA == Database Administrator\nConway's Law\ndbt\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png)\r\nTimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.\r\n\r\nYou can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.\r\n\r\nGo to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!","content_html":"

Summary

\n\n

The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.","date_published":"2023-03-05T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f6968b67-ac04-4e05-8738-e3efa591b73e.mp3","mime_type":"audio/mpeg","size_in_bytes":28993677,"duration_in_seconds":2744}]},{"id":"1b81b395-64ca-4c76-b018-a39a8c6837bf","title":"Building A Data Mesh Platform At PayPal","url":"https://www.dataengineeringpodcast.com/building-a-data-mesh-at-paypal-episode-364","content_text":"Summary\n\nThere has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nAre you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.\nYour host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the goals and scope of your work at PayPal to implement a data mesh?\n\n\nWhat are the core problems that you were addressing with this project?\nIs a data mesh ever \"done\"?\n\nWhat was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration?\nWhat was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh?\nWhat are the technical systems that you are relying on to power the different data domains?\n\n\nWhat is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency?\n\nWhat are the biggest challenges (technical and procedural) that you have encountered during your implementation?\nHow are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.)\nWhat are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh?\nWhen is a data mesh the wrong choice? \nWhat do you have planned for the future of your data mesh at PayPal?\n\n\nContact Info\n\n\nLinkedIn\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nData Mesh\n\n\nO'Reilly Book (affiliate link)\n\nThe next generation of Data Platforms is the Data Mesh\nPayPal\nConway's Law\nData Mesh For All Ages - US, Data Mesh For All Ages - UK\nData Mesh Radio\nData Mesh Community\nData Mesh In Action\nGreat Expectations\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png)\r\nTimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible.\r\n\r\nYou can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters.\r\n\r\nGo to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!","content_html":"

Summary

\n\n

There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.","date_published":"2023-02-26T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1b81b395-64ca-4c76-b018-a39a8c6837bf.mp3","mime_type":"audio/mpeg","size_in_bytes":32508291,"duration_in_seconds":2814}]},{"id":"cfd6d5fd-6bcf-4938-bfd3-551f92bfdc9d","title":"The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse","url":"https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363","content_text":"Summary\n\nCloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nHey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.\nYour host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?\n\n\nSince it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?\n\nWhat are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?\nAround the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?\n\n\nGiven the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?\n\nFor someone who wants to manage their data in Iceberg tables, what does the implementation look like?\n\n\nHow does that change based on the type of query/processing engine being used?\n\nOnce a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?\nWhen is Iceberg/Tabular the wrong choice?\nWhat do you have planned for the future of Iceberg/Tabular?\n\n\nContact Info\n\n\nLinkedIn\nrdblue on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nIceberg\n\n\nPodcast Episode\n\nHadoop\nData Lakehouse\nACID == Atomic, Consistent, Isolated, Durable\nApache Hive\nApache Impala\nBodo\n\n\nPodcast Episode\n\nStarRocks\nDremio\n\n\nPodcast Episode\n\nDDL == Data Definition Language\nTrino\nPrestoDB\nApache Hudi\n\n\nPodcast Episode\n\ndbt\nApache Flink\nTileDB\n\n\nPodcast Episode\n\nCDC == Change Data Capture\nSubstrait\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png)\r\n\r\nThe modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at <u>[dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)</u>","content_html":"

Summary

\n\n

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.","date_published":"2023-02-19T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cfd6d5fd-6bcf-4938-bfd3-551f92bfdc9d.mp3","mime_type":"audio/mpeg","size_in_bytes":35441567,"duration_in_seconds":3306}]},{"id":"4a0a6f4f-16db-4ea7-9795-9125e48f4085","title":"Let The Whole Team Participate In Data With The Quilt Versioned Data Hub","url":"https://www.dataengineeringpodcast.com/quilt-data-versioned-data-hub-episode-362","content_text":"Summary\n\nData is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been \"by developers, for developers\", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nTruly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!\nYour host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Quilt is and the story behind it?\n\n\nHow have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018?\n\nWhat are the main problems that users are trying to solve when they find Quilt?\n\n\nWhat are some of the alternative approaches/products that they are coming from?\n\nHow does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.?\nCan you describe how Quilt is implemented?\nWhat are the types of tools and systems that Quilt gets integrated with?\n\n\nHow do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities?\n\nWhat is a typical workflow for a team that is using Quilt to manage their data?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Quilt used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt?\nWhen is Quilt the wrong choice?\nWhat do you have planned for the future of Quilt?\n\n\nContact Info\n\n\nLinkedIn\n@akarve on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nQuilt Data\n\n\nPodcast Episode\n\nUW Madison\nDocker Swarm\nKaggle\nopen.quiltdata.com\nFinOS Perspective\nLakeFS\n\n\nPodcast Episode\n\nPachyderm\n\n\nPodcast Episode\n\nUnstruk\n\n\nPodcast Episode\n\nParquet\nAvro\nORC\nCloudformation\nTroposphere\nCDK == Cloud Development Kit\nShadow IT\n\n\nPodcast Episode\n\nDelta Lake\n\n\nPodcast Episode\n\nApache Iceberg\n\n\nPodcast Episode\n\nDatasette\nFrictionless\nDVC\n\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>","content_html":"

Summary

\n\n

Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been \"by developers, for developers\", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.","date_published":"2023-02-11T13:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4a0a6f4f-16db-4ea7-9795-9125e48f4085.mp3","mime_type":"audio/mpeg","size_in_bytes":28393407,"duration_in_seconds":3122}]},{"id":"8b3b1967-11ce-4d33-b077-ff124849749e","title":"Reflecting On The Past 6 Years Of Data Engineering","url":"https://www.dataengineeringpodcast.com/six-year-retrospective-episode-361","content_text":"Summary\n\nThis podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYour host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years\n\n\nInterview\n\n\nIntroduction\n6 years of running the Data Engineering Podcast\nAround the first time that data engineering was discussed as a role\n\n\nFollowed on from hype about \"data science\"\n\nHadoop era\nStreaming\nLambda and Kappa architectures\n\n\nNot really referenced anymore\n\n\"Big Data\" era of capture everything has shifted to focusing on data that presents value\n\n\nRegulatory environment increases risk, better tools introduce more capability to understand what data is useful\n\nData catalogs\n\n\nAmundsen and Alation\n\nOrchestration engine\n\n\nOozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc.\nOrchestration is now a part of most vertical tools\n\nCloud data warehouses\nData lakes\nDataOps and MLOps\nData quality to data observability\nMetadata for everything\n\n\nData catalog -> data discovery -> active metadata\n\nBusiness intelligence\n\n\nRead only reports to metric/semantic layers\nEmbedded analytics and data APIs\n\nRise of ELT\n\n\ndbt\nCorresponding introduction of reverse ETL\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast?\nWhat do you have planned for the future of the podcast?\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>","content_html":"

Summary

\n\n

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.","date_published":"2023-02-05T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8b3b1967-11ce-4d33-b077-ff124849749e.mp3","mime_type":"audio/mpeg","size_in_bytes":23430473,"duration_in_seconds":1941}]},{"id":"daad3acc-a360-4af1-888f-f28296014c30","title":"Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics","url":"https://www.dataengineeringpodcast.com/omni-analytics-automated-business-intelligence-episode-360","content_text":"Summary\n\nBusiness intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nTruly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!\nYour host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Omni Analytics is and the story behind it?\n\n\nWhat are the core goals that you are trying to achieve with building Omni?\n\nBusiness intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?\n\n\nWhat are the technical and organizational anti-patterns that typically grow up around BI systems?\n\nWhat are the elements that contribute to BI being such a difficult product to use effectively in an organization?\nCan you describe how you have implemented the Omni platform?\n\n\nHow have the design/scope/goals of the product changed since you first started working on it?\n\nWhat does the workflow for a team using Omni look like?\nWhat are some of the developments in the broader ecosystem that have made your work possible?\nWhat are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Omni used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?\nWhen is Omni the wrong choice?\nWhat do you have planned for the future of Omni?\n\n\nContact Info\n\n\nLinkedIn\n@cmerrick on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nOmni Analytics\nStitch\nRJ Metrics\nLooker\n\n\nPodcast Episode\n\nSinger\ndbt\n\n\nPodcast Episode\n\nTeradata\nFivetran\nApache Arrow\n\n\nPodcast Episode\n\nDuckDB\n\n\nPodcast Episode\n\nBigQuery\nSnowflake\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>","content_html":"

Summary

\n\n

Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.","date_published":"2023-01-29T20:45:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/daad3acc-a360-4af1-888f-f28296014c30.mp3","mime_type":"audio/mpeg","size_in_bytes":26812559,"duration_in_seconds":3043}]},{"id":"27fb653e-7a50-4ea1-978e-27c1b7f728d3","title":"Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI","url":"https://www.dataengineeringpodcast.com/tonic-ai-fake-data-generation-episode-359","content_text":"Summary\n\nThe most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nTruly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!\nData and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.\nYour host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Tonic is and the story behind it?\nWhat are the core problems that you are trying to solve?\nWhat are some of the ways that fake or obfuscated data is used in development and analytics workflows?\nchallenges of reliably subsetting data\n\n\nimpact of ORMs and bad habits developers get into with database modeling\n\nCan you describe how Tonic is implemented?\n\n\nWhat are the units of composition that you are building to allow for evolution and expansion of your product?\nHow have the design and goals of the platform evolved since you started working on it?\n\nCan you describe some of the different workflows that customers build on top of your various tools\nWhat are the most interesting, innovative, or unexpected ways that you have seen Tonic used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic?\nWhen is Tonic the wrong choice?\nWhat do you have planned for the future of Tonic?\n\n\nContact Info\n\n\nLinkedIn\n@AdamKamor on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nTonic\n\n\nDjinn\n\nDjango\nRuby on Rails\nC#\nEntity Framework\nPostgreSQL\nMySQL\nOracle DB\nMongoDB\nParquet\nDatabricks\nMockaroo\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg)\r\n\r\nThe evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization.\r\n\r\n \r\n\r\nHarnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands.\r\n\r\nGo to <u>[dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda)</u> Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDA","content_html":"

Summary

\n\n

The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.","date_published":"2023-01-22T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/27fb653e-7a50-4ea1-978e-27c1b7f728d3.mp3","mime_type":"audio/mpeg","size_in_bytes":27566987,"duration_in_seconds":2740}]},{"id":"ca6871a4-20a5-4e05-8865-d99d2ad185d8","title":"Building Applications With Data As Code On The DataOS","url":"https://www.dataengineeringpodcast.com/dataos-modern-data-company-episode-358","content_text":"Summary\n\nThe modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nTruly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nData and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.\nYour host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your mission at The Modern Data Company is and the story behind it?\nYour flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform?\n\n\nWho is the target audience?\n\nOn your site you refer to the idea of \"data as software\". What are the principles and ways of thinking that are encompassed by that concept?\n\n\nWhat are the platform capabilities that are required to make it possible?\n\nThere are 11 \"Key Features\" listed on your site for the DataOS. What was your process for identifying the \"must have\" vs \"nice to have\" features for launching the platform?\nCan you describe the technical architecture that powers your DataOS product?\n\n\nWhat are the core principles that you are optimizing for in the design of your platform?\nHow have the design and goals of the system changed or evolved since you started working on DataOS?\n\nCan you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS?\nWhat are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS?\nWhat are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.)\nWhat are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization?\nWhat are the most interesting, innovative, or unexpected ways that you have seen DataOS used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS?\nWhen is DataOS the wrong choice?\nWhat do you have planned for the future of DataOS?\n\n\nContact Info\n\n\nLinkedIn\n@srujanakula on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nModern Data Company\nAlation\nAirbyte\n\n\nPodcast Episode\n\nFivetran\n\n\nPodcast Episode\n\nAirflow\nDremio\n\n\nPodcast Episode\n\nPrestoDB\nGraphQL\nCypher graph query language\nGremlin graph query language\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg)\r\n\r\nThe evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization.\r\n\r\n \r\n\r\nHarnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands.\r\n\r\nGo to <u>[dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda)</u> Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDAMonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>","content_html":"

Summary

\n\n

The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.","date_published":"2023-01-15T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ca6871a4-20a5-4e05-8865-d99d2ad185d8.mp3","mime_type":"audio/mpeg","size_in_bytes":28436080,"duration_in_seconds":2916}]},{"id":"47611967-fa3a-4c9c-9a6f-8f246ed484b6","title":"Automate Your Pipeline Creation For Streaming Data Transformations With SQLake","url":"https://www.dataengineeringpodcast.com/sqlake-automatic-sql-pipelines-episode-357","content_text":"Summary\n\nManaging end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nData and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.\nTruly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the SQLake product is and the story behind it?\n\n\nWhat is the core problem that you are trying to solve?\n\nWhat are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow?\nWhat are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)?\nCan you describe the technical implementation of the SQLake feature?\nWhat does the workflow look like for designing and deploying pipelines in SQLake?\nWhat are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales?\n\n\nSQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling?\n\nWhat are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales?\nWhat are some of the edge cases that you have had to provide escape hatches for?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SQLake used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake?\nWhen is SQLake the wrong choice?\nWhat do you have planned for the future of SQLake?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nUpsolver\n\n\nPodcast Episode\n\nSQLake\nAirflow\nDagster\n\n\nPodcast Episode\n\nPrefect\n\n\nPodcast Episode\n\nFlyte\n\n\nPodcast Episode\n\nGitHub Actions\ndbt\n\n\nPodcast Episode\n\nPartiQL\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg)\r\n\r\nThe evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization.\r\n\r\n \r\n\r\nHarnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands.\r\n\r\nGo to <u>[dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda)</u> Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDAMaterialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png)\r\n\r\nLooking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. \r\n\r\nMaterialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features.\r\n\r\nSign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses.\r\n\r\nGo to <u>[materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)</u>MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.","content_html":"

Summary

\n\n

Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.","date_published":"2023-01-08T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/47611967-fa3a-4c9c-9a6f-8f246ed484b6.mp3","mime_type":"audio/mpeg","size_in_bytes":24494000,"duration_in_seconds":2645}]},{"id":"03db3a89-0059-4633-beec-46b0bcd8acee","title":"Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI","url":"https://www.dataengineeringpodcast.com/alignai-data-analytics-knowledge-management-episode-356","content_text":"Summary\n\nMaking effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what AlignAI is and the story behind it?\nWhat are the core problems that you are focused on addressing?\n\n\nWhat are the tactical ways that you are working to solve those problems?\n\nWhat are some of the common and avoidable ways that analytics/AI projects go wrong?\n\n\nWhat are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects?\n\nWhat are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution?\nCan you describe the design and implementation of the AlignAI platform?\n\n\nHow have the goals and implementation of the product changed since you first started working on it?\n\nWhat is the workflow at the individual and organizational level for businesses that are using AlignAI?\nOne of the perennial challenges with knowledge sharing in an organization is managing incentives to engage with the available material. What are some of the ways that you are working to integrate the creation and distribution of institutional knowledge into employees' day-to-day work?\nWhat are the most interesting, innovative, or unexpected ways that you have seen AlignAI used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on AlignAI?\nWhen is AlignAI the wrong choice?\nWhat do you have planned for the future of AlignAI?\n\n\nContact Info\n\n\nLinkedIn\n@RehganAvon on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nAlignAI\nSharepoint\nConfluence\nGitHub\nCanva\nInstructional Design\nNotion\nCoda\nWaterfall Design\ndbt\n\n\nPodcast Episode\n\nAlteryx\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png)\r\n\r\nHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?\r\n\r\nOur friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.\r\n\r\nGo to <u>[dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan)</u> and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n\n

Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.","date_published":"2022-12-29T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/03db3a89-0059-4633-beec-46b0bcd8acee.mp3","mime_type":"audio/mpeg","size_in_bytes":40859296,"duration_in_seconds":3561}]},{"id":"a9557c3c-91b5-4897-9f70-73129c9742f1","title":"Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams","url":"https://www.dataengineeringpodcast.com/vishal-singh-building-data-products-episode-355","content_text":"Summary\n\nWith all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I'm interviewing Vishal Singh about his experience building data products at Starburst\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your definition of a \"data product\" is?\n\n\nWhat are some of the different contexts in which the idea of a data product is applicable?\nHow do the parameters of a data product change across those different contexts/consumers?\n\nWhat are some of the ways that you see the conversation around the purpose and practice of building data products getting overloaded by conflicting objectives?\nWhat do you see as common challenges in data teams around how to approach product thinking in their day-to-day work?\nWhat are some of the tactical ways that product-oriented work on data problems differs from what has become common practice in data teams?\nWhat are some of the features that you are building at Starburst that contribute to the efforts of data teams to build full-featured product experiences for their data?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Starburst used in the context of data products?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working at Starburst?\nWhen is a data product the wrong choice?\nWhat do you have planned for the future of support for data product development at Starburst?\n\n\nContact Info\n\n\nLinkedIn\n@vishal_singh on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nStarburst\n\n\nPodcast Episode\n\nGeophysics\nProduct-Led Growth\nTrino\nDataNova\nStarburst Galaxy\nTableau\nPowerBI\n\n\nPodcast Episode\n\nMetabase\n\n\nPodcast Episode\n\nGreat Expectations\n\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Upsolver: ![Upsolver](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/aHJGV1kt.png)\r\nBuild Real-Time Pipelines. Not Endless DAGs!\r\n\r\nCreating real-time ETL pipelines is extremely time-consuming and engineering intensive. Why? Because when we attempt to shoehorn a 30-year old batch process into a real-time pipeline, we create an orchestration hell that makes every pipeline a data engineering project.\r\n\r\nEvery pipeline is composed of transformation logic (the what) and orchestration (the how). If you run daily batches, orchestration is simple and there’s plenty of time to recover from failures. However, real-time pipelines with per-hour or per-minute batches make orchestration intricate and data engineers find themselves burdened with building Direct Acyclic Graphs (DAGs), in tools like Apache Airflow, with 10s to 100s of steps intended to address all success and failure modes, task dependencies and maintain temporary data copies.\r\n\r\nOri Rafael, CEO and co-founder of Upsolver, will unpack this problem that bottlenecks real-time analytics delivery, and describe a new approach that completely eliminates the need for orchestration, so you can remove Airflow from your development critical path and deliver reliable production pipelines quickly.\r\n\r\nGo to [dataengineeringpodcast.com/upsolver](dataengineeringpodcast.com/upsolver) to start your 30 day trial with unlimited data, and see for yourself how to avoid DAG hell.Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.\r\n\r\nDatafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at <u>[dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)</u>\r\n today to book a demo with Datafold.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n\n

With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.","date_published":"2022-12-28T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a9557c3c-91b5-4897-9f70-73129c9742f1.mp3","mime_type":"audio/mpeg","size_in_bytes":33619566,"duration_in_seconds":3525}]},{"id":"524c405b-d562-4a40-bf5a-3c417d5fd3ab","title":"An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch","url":"https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354","content_text":"Summary\n\nFive years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nData platform building journey\n\n\nWhy are you building, who are the users/use cases\nHow to focus on doing what matters over cool tools\nHow to build a good UX\nAnything surprising or did you discover anything you didn't expect at the start\nHow to build so it's modular and can be improved in the future\n\nGeneral build vs buy and vendor selection process\n\n\nObviously have a good BS detector - how can others build theirs\nSo many tools, where do you start - capability need, vendor suite offering, etc.\nAnything surprising in doing much of this at once\nHow do you think about TCO in build versus buy\nAny advice\n\nGuest call out\n\n\nBe brave, believe you are good enough to be on the show\nLook at past episodes and don't pitch the same as what's been on recently\nAnd vendors, be smart, work with your customers to come up with a good pitch for them as guests...\n\n\n\nTobias' advice and learnings from building out a data platform:\n\n\nAdvice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs?\nAdvice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor.\nLearning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform.\nAdvice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?).\nLearning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially.\nAdvice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea...\nAdvice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions.\nLearning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues.\nAdvice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot.\nAdvice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well. \nLearning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going.\nAdvice: \"one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination.\" Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something.\nAdvice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream. \n\n\nTobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nData Mesh Community\n\n\nPodcast\n\nOSI Model\nSchemata\n\n\nPodcast Episode\n\nAtlan\n\n\nPodcast Episode\n\nOpenMetadata\n\n\nPodcast Episode\n\nChris Riccomini\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png)\r\n\r\nHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?\r\n\r\nOur friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.\r\n\r\nGo to <u>[dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan)</u> and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n\n

Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Tobias' advice and learnings from building out a data platform:

\n\n\n\n

Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.","date_published":"2022-12-25T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/524c405b-d562-4a40-bf5a-3c417d5fd3ab.mp3","mime_type":"audio/mpeg","size_in_bytes":49420648,"duration_in_seconds":4319}]},{"id":"eb7593e7-d8bd-434d-92a6-a21ebf5cf456","title":"Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems","url":"https://www.dataengineeringpodcast.com/opaque-systems-secure-data-analytics-episode-353","content_text":"Summary\n\nEncryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I'm interviewing Rishabh Poddar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Opaque Systems and the story behind it?\nWhat are the core problems related to security/privacy in data analytics and ML that organizations are struggling with?\n\n\nWhat do you see as the balance of internal vs. cross-organization applications for the solutions you are creating?\n\ncomparison with homomorphic encryption\nvalidation and ongoing testing of security/privacy guarantees\nperformance impact of encryption overhead and how to mitigate it\nUX aspects of not being able to view the underlying data\nrisks of information leakage from schema/meta information\nCan you describe how the Opaque Systems platform is implemented?\n\n\nHow have the design and scope of the product changed since you started working on it?\n\nCan you describe a typical workflow for a team or teams building an analytical process or ML project with your platform?\nWhat are some of the constraints in terms of data format/volume/variety that are introduced by working with it in the Opaque platform?\nHow are you approaching the balance of maintaining the MC2 project against the product needs of the Opaque platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Opaque platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Opaque Systems/MC2?\nWhen is Opaque the wrong choice?\nWhat do you have planned for the future of the Opaque platform?\n\n\nContact Info\n\n\nLinkedIn\nWebsite\n@Podcastinator on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nOpaque Systems\nUC Berkeley RISE Lab\nTLS\nMC²\nHomomorphic Encryption\nSecure Multi-Party Computation\nSecure Enclaves\nDifferential Privacy\nData Obfuscation\nAES == Advanced Encryption Standard\nIntel SGX (Software Guard Extensions)\nIntel TDX (Trust Domain Extensions)\nTPC-H Benchmark\nSpark\nTrino\nPyTorch\nTensorflow\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Upsolver: ![Upsolver](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/aHJGV1kt.png)\r\nBuild Real-Time Pipelines. Not Endless DAGs!\r\n\r\nCreating real-time ETL pipelines is extremely time-consuming and engineering intensive. Why? Because when we attempt to shoehorn a 30-year old batch process into a real-time pipeline, we create an orchestration hell that makes every pipeline a data engineering project.\r\n\r\nEvery pipeline is composed of transformation logic (the what) and orchestration (the how). If you run daily batches, orchestration is simple and there’s plenty of time to recover from failures. However, real-time pipelines with per-hour or per-minute batches make orchestration intricate and data engineers find themselves burdened with building Direct Acyclic Graphs (DAGs), in tools like Apache Airflow, with 10s to 100s of steps intended to address all success and failure modes, task dependencies and maintain temporary data copies.\r\n\r\nOri Rafael, CEO and co-founder of Upsolver, will unpack this problem that bottlenecks real-time analytics delivery, and describe a new approach that completely eliminates the need for orchestration, so you can remove Airflow from your development critical path and deliver reliable production pipelines quickly.\r\n\r\nGo to [dataengineeringpodcast.com/upsolver](dataengineeringpodcast.com/upsolver) to start your 30 day trial with unlimited data, and see for yourself how to avoid DAG hell.Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.\r\n\r\nDatafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at <u>[dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)</u>\r\n today to book a demo with Datafold.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n\n

Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.","date_published":"2022-12-25T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/eb7593e7-d8bd-434d-92a6-a21ebf5cf456.mp3","mime_type":"audio/mpeg","size_in_bytes":47483304,"duration_in_seconds":4105}]},{"id":"caea1d40-d875-4fcb-8379-79c2afc1570f","title":"Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle","url":"https://www.dataengineeringpodcast.com/data-dot-world-data-principles-episode-351","content_text":"Summary\n\nThe data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the areas of the data ecosystem that you see the most turmoil and confusion?\nThe past couple of years have brought a lot of attention to the idea of the \"modern data stack\". How has that influenced the ways that your and your customers' teams think about what skills they need to be effective?\nThe other topic that is introducing a lot of confusion and uncertainty is the \"data mesh\". How has that changed the ways that teams think about who is involved in the technical and design conversations around data in an organization?\nNow that we, as an industry, have reached a new generational inflection about how data is generated, processed, and used, what are some of the foundational principles that have proven their worth?\n\n\nWhat are some of the new lessons that are showing the greatest promise?\ndata modeling\ndata platform/infrastructure \ndata collaboration\ndata governance/security/privacy\n\nHow does your work at data.world work support these foundational practices? \n\n\nWhat are some of the ways that you work with your teams and customers to help them stay informed on industry practices?\nWhat is your process for understanding the balance between hype and reality as you encounter new ideas/technologies?\n\nWhat are some of the notable changes that have happened in the data.world product and market since I last had Bryon on the show in 2017?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data.world used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data.world?\nWhen is data.world the wrong choice?\nWhat do you have planned for the future of data.world?\n\n\nContact Info\n\n\nJuan\n\n\nLinkedIn\n@juansequeda on Twitter\nWebsite\n\nTim\n\n\nLinkedIn\n@TimGasper on Twitter\nWebsite\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\ndata.world\n\n\nPodcast Episode\n\nGartner Hype Cycle\nData Mesh\nModern Data Stack\nDataOps\nData Observability\nData & AI Landscape\nDataDog\nRDF == Resource Description Framework\nSPARQL\nMoshe Vardi\nStar Schema\nData Vault\n\n\nPodcast Episode\n\nBPMN == Business Process Modeling Notation\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Upsolver: ![Upsolver](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/aHJGV1kt.png)\r\nBuild Real-Time Pipelines. Not Endless DAGs!\r\n\r\nCreating real-time ETL pipelines is extremely time-consuming and engineering intensive. Why? Because when we attempt to shoehorn a 30-year old batch process into a real-time pipeline, we create an orchestration hell that makes every pipeline a data engineering project.\r\n\r\nEvery pipeline is composed of transformation logic (the what) and orchestration (the how). If you run daily batches, orchestration is simple and there’s plenty of time to recover from failures. However, real-time pipelines with per-hour or per-minute batches make orchestration intricate and data engineers find themselves burdened with building Direct Acyclic Graphs (DAGs), in tools like Apache Airflow, with 10s to 100s of steps intended to address all success and failure modes, task dependencies and maintain temporary data copies.\r\n\r\nOri Rafael, CEO and co-founder of Upsolver, will unpack this problem that bottlenecks real-time analytics delivery, and describe a new approach that completely eliminates the need for orchestration, so you can remove Airflow from your development critical path and deliver reliable production pipelines quickly.\r\n\r\nGo to [dataengineeringpodcast.com/upsolver](dataengineeringpodcast.com/upsolver) to start your 30 day trial with unlimited data, and see for yourself how to avoid DAG hell.Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.\r\n\r\nDatafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at <u>[dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)</u>\r\n today to book a demo with Datafold.Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n\n

The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.\r\n","date_published":"2022-12-18T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/caea1d40-d875-4fcb-8379-79c2afc1570f.mp3","mime_type":"audio/mpeg","size_in_bytes":47301729,"duration_in_seconds":3929}]},{"id":"6f4853a5-6def-4cca-b26c-4d0aa2af6dfe","title":"Making Sense Of The Technical And Organizational Considerations Of Data Contracts","url":"https://www.dataengineeringpodcast.com/great-expectations-data-contracts-episode-352","content_text":"Summary\n\nOne of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your conception of a data contract is?\n\n\nWhat are some of the ways that you have seen them implemented?\n\nHow has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization?\n\n\nWhat does the negotiation process look like for identifying what needs to be included in a contract?\n\nWhat are the interfaces/integration points where data contracts are most useful/necessary?\nWhat are the discussions that need to happen when deciding when/whether a contract \"violation\" is a blocking action vs. issuing a notification?\nAt what level of detail/granularity are contracts most helpful?\nAt the technical level, what does the implementation/integration/deployment of a contract look like?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data contracts used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations?\nWhen are data contracts the wrong choice?\nWhat do you have planned for the future of data contracts in great expectations?\n\n\nContact Info\n\n\nLinkedIn\n@AbeGong on Twitter\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\n\nLinks\n\n\nGreat Expectations\n\n\nPodcast Episode\n\nProgressive Typing\nPioneers, Settlers, Town Planners\nPydantic\n\n\nPodcast.__init__ Episode\n\nTypescript\nDuck Typing\nFlyte\n\n\nPodcast Episode\n\nDagster\n\n\nPodcast Episode\n\nTrino\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png)\r\n\r\nHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?\r\n\r\nOur friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.\r\n\r\nGo to <u>[dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan)</u> and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.","content_html":"

Summary

\n\n

One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.

\n\n

Announcements

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Closing Announcements

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

","summary":"One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.","date_published":"2022-12-18T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6f4853a5-6def-4cca-b26c-4d0aa2af6dfe.mp3","mime_type":"audio/mpeg","size_in_bytes":31636754,"duration_in_seconds":2820}]},{"id":"podlove-2022-12-12t02:05:16+00:00-573a1301a9d1aa2","title":"Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee","url":"https://www.dataengineeringpodcast.com/towhee-embedding-vector-etl-library-episode-350","content_text":"Preamble\nThis is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.\nSummary\nData is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development.\nAnnouncements\n\nHello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.\nBuilding good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!\nYour host is Tobias Macey and today I’m interviewing Frank Liu about how to use vector embeddings in your ML projects and how Towhee can reduce the effort involved\n\nInterview\n\nIntroduction\nHow did you get involved in machine learning?\nCan you describe what Towhee is and the story behind it?\nWhat is the problem that Towhee is aimed at solving?\nWhat are the elements of generating vector embeddings that pose the greatest challenge or require the most effort?\nOnce you have an embedding, what are some of the ways that it might be used in a machine learning project?\n\nAre there any design considerations that need to be addressed in the form that an embedding takes and how it impacts the resultant model that relies on it? (whether for training or inference)\n\n\nCan you describe how the Towhee framework is implemented?\n\nWhat are some of the interesting engineering challenges that needed to be addressed?\nHow have the design/goals/scope of the project shifted since it began?\n\n\nWhat is the workflow for someone using Towhee in the context of an ML project?\nWhat are some of the types optimizations that you have incorporated into Towhee?\n\nWhat are some of the scaling considerations that users need to be aware of as they increase the volume or complexity of data that they are processing?\n\n\nWhat are some of the ways that using Towhee impacts the way a data scientist or ML engineer approach the design development of their model code?\nWhat are the interfaces available for integrating with and extending Towhee?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Towhee used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Towhee?\nWhen is Towhee the wrong choice?\nWhat do you have planned for the future of Towhee?\n\nContact Info\n\nLinkedIn\nfzliu on GitHub\nWebsite\n@frankzliu on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest barrier to adoption of machine learning today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nTowhee\nZilliz\nMilvus\n\nData Engineering Podcast Episode\n\n\nComputer Vision\nTensor\nAutoencoder\nLatent Space\nDiffusion Model\nHSL == Hue, Saturation, Lightness\nWeights and Biases\n\nThe intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0\n\n\n","content_html":"

Preamble

\n

This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.

\n

Summary

\n

Data is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

\n
\n\n

\"\"

","summary":"An interview with Frank Liu about how the open source Towhee library simplifies the work of building pipelines to generate vector embeddings of your data for building machine learning projects.","date_published":"2022-12-11T21:14:43.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/93754b79-ee19-4ea6-852a-ee3a3df3a65e.mp3","mime_type":"audio/mpeg","size_in_bytes":32189563,"duration_in_seconds":3225}]},{"id":"podlove-2022-12-10t20:00:10+00:00-c370aaf1f11d31f","title":"Run Your Applications Worldwide Without Worrying About The Database With Planetscale","url":"https://www.dataengineeringpodcast.com/planetscale-serverless-mysql-episode-349","content_text":"Summary\nOne of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I’m interviewing Nick van Wiggeren about Planetscale, a serverless and globally distributed MySQL database as a service\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Planetscale is and the story behind it?\nWhat are the core problems that you are solving with the Planetscale platform?\n\nHow might an engineering team address those challenges in the absence of Planetscale/Vitess?\n\n\nCan you describe how Planetscale is implemented?\n\nWhat are some of the addons that you have had to build on top of Vitess to make Planetscale\n\n\nWhat are the impacts that a serverless database has on the way teams approach their application/platform design and development?\nmetrics exposed to help users optimize their usage\nWhat is your policy/philosophy for determining what capabilities to include in Vitess and what belongs in the Planetscale platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Planetscale/Vitess used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Planetscale?\nWhen is Planetscale the wrong choice?\nWhat do you have planned for the future of Planetscale?\n\nContact Info\n\n@nickvanwig on Twitter\nLinkedIn\nnickvanw on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nPlanetscale\nVitess\nCNCF == Cloud Native Computing Foundation\nHadoop\nOLTP == Online Transactional Processing\nGalera\nYugabyte DB\n\nPodcast Episode\n\n\nCitusDB\nMariaDB SkySQL\n\nPodcast Episode\n\n\nCockroachDB\n\nPodcast Episode\n\n\nNewSQL\nAWS PrivateLink\nPlanetscale Connect\nSegment\n\nPodcast Episode\n\n\nBigQuery\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nick van Wiggeren about the Planetscale serverless MySQL service built on top of the open source Vitess project and the impact on developer productivity that it offers when you don't have to worry about database operations.","date_published":"2022-12-11T20:57:22.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a138f7aa-e4c5-4d71-8704-87eb08479ae2.mp3","mime_type":"audio/mpeg","size_in_bytes":33111093,"duration_in_seconds":2980}]},{"id":"podlove-2022-12-05t00:31:15+00:00-89dc6680ea3b632","title":"Business Intelligence In The Palm Of Your Hand With Zing Data","url":"https://www.dataengineeringpodcast.com/zing-data-mobile-business-intelligence-episode-348","content_text":"Summary\nBusiness intelligence is the foremost application of data in organizations of all sizes. The typical conception of how it is accessed is through a web or desktop application running on a powerful laptop. Zing Data is building a mobile native platform for business intelligence. This opens the door for busy employees to access and analyze their company information away from their desk, but it has the more powerful effect of bringing first-class support to companies operating in mobile-first economies. In this episode Sabin Thomas shares his experiences building the platform and the interesting ways that it is being used.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I’m interviewing Sabin Thomas about Zing Data, a mobile-friendly business intelligence platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Zing Data is and the story behind it?\nWhy is mobile access to a business intelligence system important?\n\nWhat does it mean for a business intelligence system to be mobile friendly? (e.g. just looking at charts vs. creating reports, etc.)\n\n\nWhat are the interaction patterns that don’t translate well to mobile from web or desktop BI systems?\n\nWhat are the new interaction patterns that are enabled by the mobile experience?\n\n\nWhat are the capabilities that a native app can provide which would be clunky or impossible as a web app on a mobile device?\nWho are the personas that benefit from a product like Zing Data?\nCan you describe how the platform (backend and app) are implemented?\n\nHow have the design and goals of the system changed/evolved since you started working on it?\n\n\nCan you describe a typical workflow for a team that uses Zing?\n\nIs it typically the sole/primary BI system, or is it more of an augmentation?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Zing used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Zing?\nWhen is Zing the wrong choice?\nWhat do you have planned for the future of Zing Data?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nZing Data\nRakuten\nFlutter\nCordova\nReact Native\nT-SQL\nANSI SQL\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Business intelligence is the foremost application of data in organizations of all sizes. The typical conception of how it is accessed is through a web or desktop application running on a powerful laptop. Zing Data is building a mobile native platform for business intelligence. This opens the door for busy employees to access and analyze their company information away from their desk, but it has the more powerful effect of bringing first-class support to companies operating in mobile-first economies. In this episode Sabin Thomas shares his experiences building the platform and the interesting ways that it is being used.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Sabin Thomas about how Zing Data is lets you bring business intelligence with you when you're on the go with first-class support for mobile devices","date_published":"2022-12-04T19:35:04.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bfa091b2-63f1-4286-9263-632827d4d151.mp3","mime_type":"audio/mpeg","size_in_bytes":25578920,"duration_in_seconds":2806}]},{"id":"podlove-2022-12-05t00:22:14+00:00-c99ae49b0ef4491","title":"Adopting Real-Time Data At Organizations Of Every Size","url":"https://www.dataengineeringpodcast.com/materialize-real-time-data-adoption-episode-347","content_text":"Summary\nThe term \"real-time data\" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I’m interviewing Arjun Narayan about the benefits of real-time data for teams of all sizes\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your conception of real-time data is and the benefits that it can provide?\ntypes of organizations/teams who are adopting real-time\nconsumers of real-time data\nlocations in data/application stacks where real-time needs to be integrated\nchallenges (technical/infrastructure/talent) involved in adopting/supporting streaming/real-time\nlessons learned working with early customers that influenced design/implementation of Materialize to simplify adoption of real-time\ntypes of queries that are run on materialize vs. warehouse\nhow real-time changes the way stakeholders think about the data\nsourcing real-time data\nWhat are the most interesting, innovative, or unexpected ways that you have seen real-time data used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Materialize to support real-time data applications?\nWhen is real-time the wrong choice?\nWhat do you have planned for the future of Materialize and real-time data?\n\nContact Info\n\n@narayanarjun on Twitter\nEmail\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nMaterialize\n\nPodcast Episode\n\n\nCockroach Labs\n\nPodcast Episode\n\n\nSQL\nKafka\nDebezium\n\nPodcast Episode\n\n\nChange Data Capture\nReverse ETL\nPulsar\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Arjun Narayan about how to enable organizations of all sizes to take advantage of real-time data, including the technical and organizational investments required to make it happen.","date_published":"2022-12-04T19:30:35.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5e7c331c-515a-4075-810c-fa0bad1d165e.mp3","mime_type":"audio/mpeg","size_in_bytes":33790182,"duration_in_seconds":3024}]},{"id":"podlove-2022-11-28t01:03:34+00:00-a3ae2b85b27e25e","title":"Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data","url":"https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346","content_text":"Summary\nThe data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow ecosystem\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Voltron Data and the story behind it?\nWhat is the vision for the broader data ecosystem that you are trying to realize through your investment in Arrow and related projects?\n\nHow does your work at Voltron Data contribute to the realization of that vision?\n\n\nWhat is the impact on engineer productivity and compute efficiency that gets introduced by the impedance mismatches between language and framework representations of data?\nThe scope and capabilities of the Arrow project have grown substantially since it was first introduced. Can you give an overview of the current features and extensions to the project?\nWhat are some of the ways that ArrowVe and its related projects can be integrated with or replace the different elements of a data platform?\nCan you describe how Arrow is implemented?\n\nWhat are the most complex/challenging aspects of the engineering needed to support interoperable data interchange between language runtimes?\n\n\nHow are you balancing the desire to move quickly and improve the Arrow protocol and implementations, with the need to wait for other players in the ecosystem (e.g. database engines, compute frameworks, etc.) to add support?\nWith the growing application of data formats such as graphs and vectors, what do you see as the role of Arrow and its ideas in those use cases?\nFor workflows that rely on integrating structured and unstructured data, what are the options for interaction with non-tabular data? (e.g. images, documents, etc.)\nWith your support-focused business model, how are you approaching marketing and customer education to make it viable and scalable?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Arrow used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Arrow and its ecosystem?\nWhen is Arrow the wrong choice?\nWhat do you have planned for the future of Arrow?\n\nContact Info\n\nWebsite\nwesm on GitHub\n@wesmckinn on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nVoltron Data\nPandas\n\nPodcast Episode\n\n\nApache Arrow\nPartial Differential Equation\nFPGA == Field-Programmable Gate Array\nGPU == Graphics Processing Unit\nUrsa Labs\nVoltron (cartoon)\nFeature Engineering\nPySpark\nSubstrait\nArrow Flight\nAcero\nArrow Datafusion\nVelox\nIbis\nSIMD == Single Instruction, Multiple Data\nLance\nDuckDB\n\nPodcast Episode\n\n\nData Threads Conference\nNano-Arrow\nArrow ADBC Protocol\nApache Iceberg\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png)\r\n\r\nHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?\r\n\r\nOur friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.\r\n\r\nGo to <u>[dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan)</u> and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png)\r\n\r\nStruggling with broken pipelines? Stale dashboards? Missing data?\r\n\r\nIf this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!\r\n\r\nTrusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!\r\n\r\nVisit <u>[dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo)</u> to learn more.","content_html":"

Summary

\n

The data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Wes McKinney about his work at Voltron Data to support and grow the Arrow project and its integration with the broader data ecosystem","date_published":"2022-11-27T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6a022cf0-8873-4fb5-968a-ebf3efca9917.mp3","mime_type":"audio/mpeg","size_in_bytes":34434473,"duration_in_seconds":3025}]},{"id":"podlove-2022-11-28t00:49:46+00:00-890c20dac98da5f","title":"Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase","url":"https://www.dataengineeringpodcast.com/featurebase-bitmap-olap-database-episode-345","content_text":"Summary\nThe most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.\nYour host is Tobias Macey and today I’m interviewing Matt Jaffee about FeatureBase (formerly known as Pilosa and Molecula), a real-time analytical database engine built on bitmaps\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what FeatureBase is?\nWhat are the use cases that it is designed and optimized for?\n\nWhat are some applications or analyses that are uniquely suited to FeatureBase’s capabilities?\n\n\nWhat are the notable changes/evolutions that it has gone through in recent years?\n\nWhat are the forces in the broader data ecosystem that have had the greatest impact on your project/product focus?\n\n\nWhat are the data modeling concepts that platform and data engineers need to consider when working with FeatureBase?\n\nWith bitmaps as the core data structure, what is involved in translating existing data into bitmaps?\n\n\nHow does schema evolution translate to the data representation used in FeatureBase?\nHow does the data model influence considerations around security policies and governance?\nWhat are the most interesting, innovative, or unexpected ways that you have seen FeatureBase used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on FeatureBase?\nWhen is FeatureBase the wrong choice?\nWhat do you have planned for the future of FeatureBase?\n\nContact Info\n\nLinkedIn\njaffee on GitHub\n@mattjaffee on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nFeatureBase\n\nPilosa Episode\nMolecula Episode\n\n\nBitmap\nRoaring Bitmaps\nPinecone\n\nPodcast Episode\n\n\nMilvus\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png)\r\n\r\nRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.\r\n\r\nRudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.\r\n\r\nRudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.\r\n\r\nVisit <u>[dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)</u> to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.","content_html":"

Summary

\n

The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Matt Jaffee about FeatureBase, an open source bitmap database that allows you to query and analyze massive data sets at interactive speeds and the work they have done to simplify integration with the rest of your data platform.","date_published":"2022-11-27T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c1a0a15f-d362-4f1c-8ea5-70da245dac4d.mp3","mime_type":"audio/mpeg","size_in_bytes":39390726,"duration_in_seconds":3564}]},{"id":"podlove-2022-11-21t03:08:25+00:00-9504e3cdb8fd0fb","title":"A Look At The Data Systems Behind The Gameplay For League Of Legends","url":"https://www.dataengineeringpodcast.com/league-of-legends-data-team-episode-344","content_text":"Summary\nThe majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. He explains the constraints that he and his team are faced with and the various challenges that they have overcome to build useful data products on top of a legacy platform where they don’t control the end-to-end systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Ian Schweer about building the data systems that power League of Legends\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what League of Legends is and the role that data plays in the experience?\nWhat are the characteristics of the data that you are working with? (e.g. volume/variety/velocity, structured vs. unstructured, real-time vs. batch, etc.)\n\nWhat are the biggest data-related challenges that you face (technically or organizationally)?\n\n\nMultiplayer games are very sensitive to latency. How does that influence your approach to instrumentation/data collection in the end-user experience?\nCan you describe the current architecture of your data platform?\n\nWhat are the notable evolutions that it has gone through over the life of the game/product?\n\n\nWhat are the capabilities that you are optimizing for in your platform architecture?\nGiven the longevity of the League of Legends product, what are the practices and design elements that you rely on to help onboard new team members?\n\nWhat are the seams that you intentionally build in to allow for evolution of components and use cases?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen data and its derivatives used by Riot Games or your players?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the data stack for League of Legends?\nWhat are the most interesting or informative mistakes that you have made (personally or as a team)?\nWhat do you have planned for the future of the data stack at Riot Games?\n\nContact Info\n\nLinkedIn\nGithub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nRiot Games\nLeague of Legends\nTeam Fight Tactics\nWild Rift\nDoorDash\n\nPodcast Interview\n\n\nDecision Science\nKafka\nAlation\nAirflow\nSpark\nMonte Carlo\n\nPodcast Episode\n\n\nlibtorch\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Hevo: ![Hevo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4VC62YUo.png)\r\n\r\nAre you sick of repetitive, time-consuming ELT work? Step off the hamster wheel and opt for an automated data pipeline like Hevo.\r\n\r\nHevo is a reliable and intuitive data pipeline platform that enables near real-time data movement from 150+ disparate sources to the destination of your choice. Hevo lets you set up pipelines in minutes, and its fault-tolerant architecture ensures no fire-fighting on your end. The pipelines are purpose-built to be ‘set and forget,’ ensuring zero coding or maintenance to keep data flowing 24×7. All it takes is 3 steps for your pipeline to be up and running. Moreover, transparent pricing and 24×7 live tech support ensure 24×7 peace of mind for you.\r\n\r\nDon’t waste another minute on unreliable data pipelines or painstaking manual maintenance. Sprint your way towards near real-time data integration with a pipeline that is easy to set up and even easier to control. Head over to <u>[dataengineeringpodcast.com/hevo](https://www.dataengineeringpodcast.com/hevodata)</u> and sign up for a free 14-day trial that also comes with 24×7 support.\r\n\r\nSelect Star: ![Select Star](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/65NZFtJd.png)\r\n\r\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data.\r\n\r\nFrom analyzing your metadata, query logs, and dashboard activities, Select Star will automatically document your datasets. For every table in Select Star, you can find out where the data originated from, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use.\r\n\r\nWith Select Star’s data catalog, a single source of truth in data is built in minutes, even across thousands of datasets.\r\n\r\nTry it out for free at <u>[dataengineeringpodcast.com/selectstar](https://www.dataengineeringpodcast.com/selectstar)</u> If you’re a data engineering podcast subscriber, we’ll double the length of your free trial and send you a swag package when you continue on a paid plan.","content_html":"

Summary

\n

The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. He explains the constraints that he and his team are faced with and the various challenges that they have overcome to build useful data products on top of a legacy platform where they don’t control the end-to-end systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Ian Schweer about the data team behind the League of Legends franchise and how they manage to innovate in the face of legacy systems.","date_published":"2022-11-20T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bd7ae647-5e3b-4d27-ba19-f867f11fe228.mp3","mime_type":"audio/mpeg","size_in_bytes":37411077,"duration_in_seconds":3689}]},{"id":"podlove-2022-11-19t13:38:18+00:00-3f32891303da8fd","title":"Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet","url":"https://www.dataengineeringpodcast.com/sifflet-full-stack-data-obserability-episode-343","content_text":"Summary\nThe problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of \"data entropy\" and how you can tame it before it leads to failures.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Sifflet is and the story behind it?\n\nWhat is the motivating goal for the company and product?\n\n\nWhat are the categories of errors that you consider to be preventable?\n\nHow does the visibility provided by Sifflet contribute to those prevention efforts?\n\n\nWhat are the UI/UX patterns that you rely on to allow for meaningful exploration and analysis of dependency chains/impact assessments in the lineage graph?\nCan you describe how you’ve implemented Sifflet?\n\nHow have the scope and focus of the product evolved from when you first launched?\n\n\nWhat is the workflow for someone getting Sifflet integrated into their data stack?\nWhat are some of the data modeling considerations that need to be considered when pushing metadata to Sifflet?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Sifflet used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Sifflet?\nWhen is Sifflet the wrong choice?\nWhat do you have planned for the future of Sifflet?\n\nContact Info\n\nLinkedIn\n@SalmaBakouk on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSifflet\nData Observability\nDataDog\nNewRelic\nSplunk\nModern Data Stack\nGoCardless\nAirbyte\nFivetran\nORM == Object Relational Mapping\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Ascend: ![Ascend](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/E4SVhJLU.png)\r\n\r\nAscend.io, the Data Automation Cloud, provides the most advanced automation for data and analytics engineering workloads. Ascend.io unifies the core capabilities of data engineering—data ingestion, transformation, delivery, orchestration, and observability—into a single platform so that data teams deliver 10x faster. With 95% of data teams already at or over capacity, engineering productivity is a top priority for enterprises. Ascend’s Flex-code user interface empowers any member of the data team—from data engineers to data scientists to data analysts—to quickly and easily build and deliver on the data and analytics workloads they need. And with Ascend’s DataAware™ intelligence, data teams no longer spend hours carefully orchestrating brittle data workloads and instead rely on advanced automation to optimize the entire data lifecycle. Ascend.io runs natively on data lakes and warehouses and in AWS, Google Cloud and Microsoft Azure.\r\n\r\nGo to <u>[dataengineeringpodcast.com/ascend](https://www.dataengineeringpodcast.com/ascend)</u>\r\n to find out more.","content_html":"

Summary

\n

The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Salma Bakouk about how to use data entropy as a model for identifying and resolving problems in your data platform before they occur and Sifflet's approach to full stack data observability.","date_published":"2022-11-20T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6c6644e0-7ad8-4b60-b28a-cd9882613a94.mp3","mime_type":"audio/mpeg","size_in_bytes":28357795,"duration_in_seconds":2806}]},{"id":"podlove-2022-11-14t02:22:00+00:00-93b56fd0b231beb","title":"Taking A Look Under The Hood At CreditKarma's Data Platform","url":"https://www.dataengineeringpodcast.com/creditkarma-data-platform-episode-341","content_text":"Summary\nCreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what CreditKarma is and the role of data in the business?\nWhat is the current team topology that you are using to support data needs in the organization?\n\nHow has that evolved from when you first started with the company?\n\n\nWhat are some of the characteristics of the data that you work with? (e.g. volume/variety/velocity, source of the data, format of the data)\n\nWhat are the aspects of data management and architecture that have posed the greatest challenge?\n\n\nWhat are the data applications that are providing the greatest ROI and/or seeing the most usage?\nHow have you approached the design and growth of your data platform?\nCreditKarma was one of the first FinTech companies to migrate to the cloud, specifically GCP. Why migrate? What were some of your early challenges taking the company to the cloud?\nWhat are the main components of your data platform?\n\nWhat are the most notable evolutions that it has gone through?\n\n\nGiven your strong focus on applications of data science and ML, how has that influenced the architectural foundations of your data capabilities?\nWhat is your process for evaluating build vs. buy decisions?\n\nWhat are your triggers for deciding when to re-evaluate components of your platform?\n\n\nGiven your work with financial institutions how do you address testing and validation of your derived data? How does your team solve for data reliability and quality more broadly?\nWhat are the most interesting, innovative, or unexpected aspects of your growth as a data-led organization?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building up your data platform and teams?\nWhen are the most informative mistakes that you have made?\nWhat do you have planned for the future of your data platform?\n\nContact Info\n\nLinkedIn\n@vishnuvram on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nCreditKarma\nGames 24×7\nVertica\nBigQuery\nGoogle Cloud Dataflow\nAnodot\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Vishnu Venkataraman about his work on the data platform at CreditKarma and how it has evolved over the years that he has been there and their journey to the cloud.","date_published":"2022-11-13T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e8c0e947-e976-484a-8eb3-1dac7cfba522.mp3","mime_type":"audio/mpeg","size_in_bytes":31815972,"duration_in_seconds":3122}]},{"id":"podlove-2022-11-14t03:20:28+00:00-5ee22d6248bc2de","title":"Build Data Products Without A Data Team Using AgileData","url":"https://www.dataengineeringpodcast.com/agiledata-data-platform-service-episode-342","content_text":"Summary\nBuilding data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Shane Gibson about AgileData, a platform that lets you build data products without all of the overhead of managing a data team\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what AgileData is and the story behind it?\nWho is the target audience for this product?\n\nFor organizations that have an existing data team, how does the platform augment/simplify their work?\n\n\nCan you describe how the AgileData platform is implemented?\n\nWhat are some of the notable evolutions that it has gone through since you first started working on it?\nGiven your strong focus on Agile methods in your work, how has that influenced your priorities in developing the platform?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen AgileData used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?\nWhen is AgileData the wrong choice?\nWhat do you have planned for the future of AgileData?\n\nContact Info\n\nLinkedIn\n@shagility on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAgileData\n\nAgile Practices For Data Interview\n\n\nMicrosoft Azure\nSnowflake\nBigQuery\nDuckDB\n\nPodcast Episode\n\n\nGoogle BI Engine\nOLAP\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shane Gibson about his work on the AgileData service and how it encodes agile practices into a self-serve platform which allows organizations to deliver reliable data products without having to hire an entirely new engineering team to support them.","date_published":"2022-11-13T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/826e6e48-3c82-484f-8df3-20353a8f34c3.mp3","mime_type":"audio/mpeg","size_in_bytes":46106809,"duration_in_seconds":4349}]},{"id":"podlove-2022-11-06t23:42:15+00:00-967d12b9c473d60","title":"Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg","url":"https://www.dataengineeringpodcast.com/zingg-open-source-entity-resolution-episode-339","content_text":"Summary\nDespite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution framework for data engineers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Zingg is and the story behind it?\nWho is the target audience for Zingg?\n\nHow has that informed your efforts in the development and release of the project?\n\n\nWhat are the use cases where entity resolution is helpful or necessary in a data engineering context?\nWhat are the range of options that are available for teams to implement entity/identity resolution in their data?\n\nWhat was your motivation for creating an open source solution for this use case?\nWhy do you think there has not been a compelling open source and generalized solution previously?\n\n\nCan you describe how Zingg is implemented?\n\nHow have the design and goals shifted since you started working on the project?\n\n\nWhat does the installation and integration process look like for Zingg?\nOnce you have Zingg configured, what is the workflow for a data engineer or analyst?\nWhat are the extension/customization options for someone using Zingg in their environment?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Zingg used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Zingg?\nWhen is Zingg the wrong choice?\nWhat do you have planned for the future of Zingg?\n\nContact Info\n\nLinkedIn\n@sonalgoyal on Twitter\nsonalgoyal on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nZingg\nEntity Resolution\nMDM == Master Data Management\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nSnowpark\nSpark\nMilvus\n\nPodcast Episode\n\n\nPinecone\n\nPodcast Episode\n\n\nDuckDB\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Sonal Goyal about Zingg, the open source and customizable framework for scalable entity resolution, data mastering, and cleaning without having to start from scratch","date_published":"2022-11-06T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/150984c9-fbe6-4db1-a4ee-ba86d9aea5e4.mp3","mime_type":"audio/mpeg","size_in_bytes":27394636,"duration_in_seconds":2806}]},{"id":"podlove-2022-11-07t00:01:56+00:00-6233ca33a61f502","title":"Build Better Data Products By Creating Data, Not Consuming It","url":"https://www.dataengineeringpodcast.com/snowplow-behavioral-data-creation-episode-340","content_text":"Summary\nA lot of the work that goes into data engineering is trying to make sense of the \"data exhaust\" from other applications and services. There is an undeniable amount of value and utility in that information, but it also introduces significant cost and time requirements. In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Nick King about the utility of behavioral data for your data products and the technical and strategic considerations to collect and integrate it\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you share your definition of \"behavioral data\" and how it is differentiated from other sources/types of data?\n\nWhat are some of the unique characteristics of that information?\nWhat technical systems are required to generate and collect those interactions?\n\n\nWhat are the organizational patterns that are required to support effective workflows for building data generation capabilities?\n\nWhat are some of the strategies that have been most effective for bringing together data and application teams to identify and implement what behaviors to track?\n\n\nWhat are some of the ethical and privacy considerations that need to be addressed when working with end-user behavioral data?\nThe data sources associated with business operations services and custom applications already represent some measure of user interaction and behaviors. How can teams use the information available from those systems to inform and augment the types of events/information that should be captured/generated in a system like Snowplow?\nCan you describe the workflow for a team using Snowplow to generate data for a given analytical/ML project?\n\nWhat are some of the tactical aspects of deciding what interfaces to use for generating interaction events?\nWhat are some of the event modeling strategies to keep in mind to simplify the analysis and integration of the generated data?\n\n\nWhat are some of the notable changes in implementation and focus for Snowplow over the past ~4 years?\n\nHow has the emergence of the \"modern data stack\" influenced the product direction?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Snowplow used for data generation/behavioral data collection?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Snowplow?\nWhen is Snowplow the wrong choice?\nWhat do you have planned for the future of Snowplow?\n\nContact Info\n\nLinkedIn\n@nking on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSnowplow\n\nPodcast Episode\nPrivate SaaS Episode\n\n\nAS/400\nDB2\nBigQuery\nAzure SQL\nData Robot\nGoogle Spanner\nMRE == Meals Ready to Eat\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A lot of the work that goes into data engineering is trying to make sense of the "data exhaust" from other applications and services. There is an undeniable amount of value and utility in that information, but it also introduces significant cost and time requirements. In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nick King about how being deliberate about data creation can produce better and faster results than just consuming whatever data is available","date_published":"2022-11-06T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/82ecee25-56ee-476c-9cee-b3eba72ca89c.mp3","mime_type":"audio/mpeg","size_in_bytes":36177569,"duration_in_seconds":3919}]},{"id":"podlove-2022-10-31t00:42:06+00:00-89268b57f890acf","title":"Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense","url":"https://www.dataengineeringpodcast.com/sisense-business-intelligence-embedded-analytics-episode-338","content_text":"Summary\nBusiness intelligence has grown beyond its initial manifestation as dashboards and reports. In its current incarnation it has become a ubiquitous need for analytics and opportunities to answer questions with data. In this episode Amir Orad discusses the Sisense platform and how it facilitates the embedding of analytics and data insights in every aspect of organizational and end-user experiences.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Amir Orad about Sisense, a platform focused on providing intelligent analytics everywhere\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Sisense is and the story behind it?\nWhat are the use cases and customers that you are focused on supporting?\nWhat is your view on the role of business intelligence in a data driven organization?\n\nHow has the market shifted in recent years and what are the motivating factors for those changes?\n\n\nMany conversations around data and analytics are focused on self-service access. what are the capabilities that are required to make that a reality?\n\nWhat are the core challenges that teams face on their path to designing and implementing a solution that is comprehensible by their stakeholders?\nWhat is the role of automation vs. low-/no-code?\n\n\nWhat are the unique capabilities that Sisense offers compared to other BI or embedded analytics services?\nCan you describe how the Sisense platform is implemented?\n\nHow have the design and goals changed since you started working on it?\n\n\nWhat is the workflow for someone working with Sisense?\n\nWhat are the options for integrating Sisense with an organization’s data platform?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Sisense used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Sisense?\nWhen is Sisense the wrong choice?\nWhat do you have planned for the future of Sisense?\n\nContact Info\n\nLinkedIn\n@AmirOrad on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSisense\nLooker\n\nPodcast Episode\n\n\nPowerBI\n\nPodcast Episode\n\n\nBusiness Intelligence\nSnowflake\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg)\r\n\r\nYour data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: <u>[dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode)</u> today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!","content_html":"

Summary

\n

Business intelligence has grown beyond its initial manifestation as dashboards and reports. In its current incarnation it has become a ubiquitous need for analytics and opportunities to answer questions with data. In this episode Amir Orad discusses the Sisense platform and how it facilitates the embedding of analytics and data insights in every aspect of organizational and end-user experiences.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Amir Orad about how the Sisense platform power embedded analytics experiences and brings the promise of business intelligence beyond the bounds of prefabricated dashboards and reports.","date_published":"2022-10-30T20:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/99116e37-1c13-4575-a960-ee27fbabee0b.mp3","mime_type":"audio/mpeg","size_in_bytes":29913726,"duration_in_seconds":3240}]},{"id":"podlove-2022-10-30t23:41:18+00:00-2f9fd4b0efd3aa8","title":"Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt","url":"https://www.dataengineeringpodcast.com/optimus-dbt-analytics-engineering-sisense-episode-337","content_text":"Summary\nOne of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Nandam Karthik about his experiences building analytics projects with dbt and Optimus for his clients at Sigmoid.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Sigmoid is and the types of projects that you are involved in?\n\nWhat are some of the core challenges that your clients are facing when they start working with you?\n\n\nAn ELT workflow with dbt as the transformation utility has become a popular pattern for building analytics systems. Can you share some examples of projects that you have built with this approach?\n\nWhat are some of the ways that this pattern becomes bespoke as you start exploring a project more deeply?\n\n\nWhat are the sharp edges/white spaces that you encountered across those projects?\nCan you describe what Optimus is?\n\nHow does Optimus improve the user experience of teams working in dbt?\n\n\nWhat are some of the tactical/organizational practices that you have found most helpful when building with dbt and Optimus?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Optimus/dbt used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt/Optimus projects?\nWhen is Optimus/dbt the wrong choice?\nWhat are your predictions for how \"best practices\" for analytics projects will change/evolve in the near/medium term?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSigmoid\nOptimus\ndbt\n\nPodcast Episode\n\n\nAirflow\nAWS Glue\nBigQuery\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png)\r\n\r\nDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.\r\n\r\nDatafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at <u>[dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold)</u>\r\n today to book a demo with Datafold.","content_html":"

Summary

\n

One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Nandam Karthik about his experiences at Sisense combining Optimus and dbt to deliver analytics projects without the overhead of complex pipeline development so that analysts can own the end-to-end workflow.","date_published":"2022-10-30T19:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/496ef2f6-507a-4968-9d96-d1e3183fce11.mp3","mime_type":"audio/mpeg","size_in_bytes":22130747,"duration_in_seconds":2409}]},{"id":"podlove-2022-10-22t17:09:07+00:00-f9b269ba44db1d8","title":"How To Bring Agile Practices To Your Data Projects","url":"https://www.dataengineeringpodcast.com/agile-practices-for-data-projects-episode-336","content_text":"Summary\nAgile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what AgileData is and the story behind it?\nWhat are the main industries and/or use cases that you are focused on supporting?\nThe data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis?\nOne of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?\n\nHow do you design in affordances for refactoring of the data models without breaking downstream assets?\n\n\nAnother aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products?\nWhat are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products?\nIn order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)\n\nHow do you work with the \"customers\" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?\n\n\nWhat are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets?\nWhat are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?\nWhen is agile the wrong choice for a data project?\nWhat do you have planned for the future of AgileData?\n\nContact Info\n\nLinkedIn\n@shagility on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAgileData\nOptimalBI\nHow To Make Toast\nData Mesh\nInformation Product Canvas\nDataKitchen\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\nSoda Data\n\nPodcast Episode\n\n\nGoogle DataStore\nUnfix.work\nActivity Schema\n\nPodcast Episode\n\n\nData Vault\n\nPodcast Episode\n\n\nStar Schema\nLean Methodology\nScrum\nKanban\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png)\r\n\r\nHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?\r\n\r\nOur friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.\r\n\r\nGo to <u>[dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan)</u> and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.Prefect: ![Prefect](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/BZGGl8wE.png)\r\n\r\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it.\r\nTrusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit <u>[dataengineeringpodcast.com/prefect](https://www.dataengineeringpodcast.com/prefect)</u>.","content_html":"

Summary

\n

Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Shane Gibson about how to apply agile development practices to your data projects while avoiding overwhelming technical debt","date_published":"2022-10-23T19:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a31da5ba-1345-4937-b47e-83f70a5c857f.mp3","mime_type":"audio/mpeg","size_in_bytes":57980171,"duration_in_seconds":4337}]},{"id":"podlove-2022-10-22t17:04:10+00:00-338ed4711f6ee41","title":"Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB","url":"https://www.dataengineeringpodcast.com/mariadb-operational-and-analytical-database-episode-335","content_text":"Summary\nThe database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Manjot Singh about MariaDB, one of the leading open source database engines\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what MariaDB is and the story behind it?\nMariaDB started as a fork of the MySQL engine, what are the notable differences that have evolved between the two projects?\n\nHow have the MariaDB team worked to maintain compatibility for users who want to switch from MySQL?\n\n\nWhat are the unique capabilities that MariaDB offers?\nBeyond the core open source project you have built a suite of commercial extensions. What are the use cases/capabilities that you are targeting with those products?\nHow do you balance the time and effort invested in the open source engine against the commercial projects to ensure that the overall effort is sustainable?\n\nWhat are your guidelines for what features and capabilities are released in the community edition and which are more suited to the commercial products?\n\n\nFor your managed cloud service, what are the differentiating factors for that versus the database services provided by the major cloud platforms?\n\nWhat do you see as the future of the database market and how we interact and integrate with them?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen MariaDB used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on MariaDB?\nWhen is MariaDB the wrong choice?\nWhat do you have planned for the future of MariaDB?\n\nContact Info\n\nLinkedIn\n@ManjotSingh on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nMariaDB\nHTML Goodies\nMySQL\nPHP\nMySQL/MariaDB Pluggable Storage\nInnoDB\nMyISAM\nAria Storage\nSQL/PSM\nMyRocks\nMariaDB XPand\nBSL == Business Source License\nPaxos\nMariaDB MongoDB Compatibility\nVertica\nMariaDB Spider Storage Engine\nIHME == Institute for Health Metrics and Evaluation\nRundeck\nMaxScale\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Manjot Singh, field CTO of MariaDB, about their eponymous open source database and how they are continuing to evolve and innovate.","date_published":"2022-10-23T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5bcdbe15-d98b-4d9a-9a6c-f69b421e09b3.mp3","mime_type":"audio/mpeg","size_in_bytes":36922515,"duration_in_seconds":3124}]},{"id":"podlove-2022-10-16t23:15:19+00:00-1f5b914695fe6d5","title":"Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks","url":"https://www.dataengineeringpodcast.com/pathway-database-that-thinks-episode-334","content_text":"Summary\nLogistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Adrian Kosowski about Pathway, an AI powered database and streaming framework. Pathway is used for analyzing and optimizing supply chains and logistics in real-time.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Pathway is and the story behind it?\nWhat are the primary challenges that you are working to solve?\n\nWho are the target users of the Pathway product and how does it fit into their work?\n\n\nYour tagline is that Pathway is \"the database that thinks\". What are some of the ways that existing database and stream-processing architectures introduce friction on the path to analysis?\n\nHow does Pathway incorporate computational capabilities into its engine to address those challenges?\n\n\nWhat are the types of data that Pathway is designed to work with?\nCan you describe how the Pathway engine is implemented?\n\nWhat are some of the ways that the design and goals of the product have shifted since you started working on it?\n\n\nWhat are some of the ways that Pathway can be integrated into an analytical system?\nWhat is involved in adapting its capabilities to different industries?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Pathway used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Pathway?\nWhen is Pathway the wrong choice?\nWhat do you have planned for the future of Pathway?\n\nContact Info\n\nAdrian Kosowski\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nPathway\nPathway for developers\nSPOJ.com – competitive programming community\nSpatiotemporal Data\nPointers in programming\nClustering\nThe Halting Problem\nPytorch\n\nPodcast.__init__ Episode\n\n\nTensorflow\nMarkov Chains\nNetworkX\nFinite State Machine\nDTW == Dynamic Time Warping\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Adrian Kosowski about Pathway, the database that thinks, and how it is designed to perform real time analysis on data that powers logistics and supply chains so businesses can survive in the modern economy","date_published":"2022-10-16T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/89f784d7-7010-4f78-8c04-4acdfd7630bd.mp3","mime_type":"audio/mpeg","size_in_bytes":53634543,"duration_in_seconds":3756}]},{"id":"podlove-2022-10-16t20:41:47+00:00-2c5ae1a5412df10","title":"An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem","url":"https://www.dataengineeringpodcast.com/dremio-open-data-lakehouse-episode-333","content_text":"Summary\nThe \"data lakehouse\" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be \"open\" and describes the different components that the Dremio team build and contribute to.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Dremio is and the story behind it?\nWhat are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years?\n\nHow has the advent of the lakehouse paradigm influenced the product direction?\n\n\nWhat are the main benefits that a lakehouse design offers to a data platform?\nWhat are some of the architectural patterns that are only possible with a lakehouse?\nWhat is the distinction you make between a lakehouse and an open lakehouse?\nWhat are some of the unique features that Dremio offers for lakehouse implementations?\nWhat are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem?\n\nHow are those projects/investments being used in the commercial offering?\n\n\nWhat is the purchase/usage model that customers expect for lakehouse implementations?\n\nHow have those expectations shifted since the first iterations of Dremio?\n\n\nDremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Dremio used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio?\nWhen is Dremio the wrong choice?\nWhat do you have planned for the future of Dremio?\n\nContact Info\n\nEmail\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nDremio\n\nPodcast Episode\n\n\nDremio Sonar\nDremio Arctic\nDML == Data Modification Language\nSpark\nData Lake\nTrino\nPresto\nDremio Data Reflections\nTableau\nDelta Lake\n\nPodcast Episode\n\n\nApache Impala\nApache Arrow\nDuckDB\n\nPodcast Episode\n\n\nGoogle BigLake\nProject Nessie\nApache Iceberg\n\nPodcast Episode\n\n\nHive Metastore\nAWS Glue Catalog\nDremel\nApache Drill\nArrow Gandiva\ndbt\nAirbyte\n\nPodcast Episode\n\n\nSinger\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Jason Hughes about the Dremio product suite and their various contributions to the open data lakehouse ecosystem.","date_published":"2022-10-16T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/74c6f5c4-a20b-4e51-bcab-e108f44845d3.mp3","mime_type":"audio/mpeg","size_in_bytes":35351539,"duration_in_seconds":3044}]},{"id":"podlove-2022-10-10t00:59:01+00:00-a1f3b73d7787021","title":"Making The Open Data Lakehouse Affordable Without The Overhead At Iomete","url":"https://www.dataengineeringpodcast.com/iomete-open-data-lakehouse-service-episode-332","content_text":"Summary\nThe core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Iomete is and the story behind it?\nThe selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse?\nThe principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge?\n\nWhat are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies?\nWhat are some of the shortcomings of lakehouse architectures?\n\n\nWhat are the fundamental capabilities that are necessary to run a fully functional lakehouse?\nCan you describe how the Iomete platform is implemented?\n\nWhat was your process for deciding which elements to adopt off the shelf vs. building from scratch?\nWhat do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio?\n\n\nWhat are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete?\nWhat have been the most challenging aspects of building a competitive business in such an active product category?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Iomete used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete?\nWhen is Iomete the wrong choice?\nWhat do you have planned for the future of Iomete?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nIomete\nFivetran\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nDatabricks\nCollibra\n\nPodcast Episode\n\n\nTalend\nParquet\nTrino\nSpark\nPresto\nSnowpark\nIceberg\n\nPodcast Episode\n\n\nIomete dbt adapter\nSinger\nMeltano\n\nPodcast Episode\n\n\nAWS Interface Gateway\nApache Hudi\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nAWS EMR\nAWS Athena\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Vusal Dadalov about the Iomete platform and how they are building a managed data lakehouse using open technologies and formats without the overhead of running it yourself or paying more than if you hosted it yourself.","date_published":"2022-10-09T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/54c123df-ba87-4729-80e0-06c070663002.mp3","mime_type":"audio/mpeg","size_in_bytes":42507746,"duration_in_seconds":3324}]},{"id":"podlove-2022-10-10t00:51:04+00:00-851b893e62baa71","title":"Investing In Understanding The Customer Journey At American Express","url":"https://www.dataengineeringpodcast.com/american-express-customer-360-episode-331","content_text":"Summary\nFor any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Purvi Shah about building the Customer 360 data product for American Express and migrating their enterprise data platform to the cloud\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the Customer 360 project is and the story behind it?\nWhat are the types of questions and insights that the C360 project is designed to answer?\n\nCan you describe the types of information and data sources that you are relying on to feed this project?\n\n\nWhat are the different axes of scale that you have had to address in the design and architecture of the C360 project? (e.g. geographical, volume/variety/velocity of data, scale of end-user access and data manipulation, etc.)\nWhat are some of the challenges that you have had to address in order to build and maintain the map between organizational and technical requirements/semantics in the platform?\n\nWhat were some of the early wins that you targeted, and how did the lessons from those successes drive the product design going forward?\n\n\nCan you describe the platform architecture for your data systems that are powering the C360 product?\n\nHow have the design/goals/requirements of the system changed since you first started working on it?\n\n\nHow have you approached the integration and migration of legacy data systems and assets into this new platform?\n\nWhat are some of the ongoing maintenance challenges that the legacy platforms introduce?\n\n\nCan you describe how you have approached the question of data quality/observability and the validation/verification of the generated assets?\nWhat are the aspects of governance and access control that you need to deal with being part of a financial institution?\nNow that the C360 product has been in use for a few years, what are the strategic and tactical aspects of the ongoing evolution and maintenance of the product which you have had to address?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the C360 product used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on C360 for American Express?\nWhen is a C360 project the wrong choice?\nWhat do you have planned for the future of C360 and enterprise data platforms at American Express?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nData Stewards\nHadoop\nSBA Paycheck Protection\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

For any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management","date_published":"2022-10-09T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f3ae3ef3-1185-4a74-95a6-55ecd91cdcef.mp3","mime_type":"audio/mpeg","size_in_bytes":30471415,"duration_in_seconds":2443}]},{"id":"podlove-2022-10-03t01:34:01+00:00-1160637970ae86f","title":"Gain Visibility And Insight Into Your Supply Chains Through Operational Analytics Powered By Roambee","url":"https://www.dataengineeringpodcast.com/roambee-operational-analytics-logistics-episode-330","content_text":"Summary\nThe global economy is dependent on complex and dynamic networks of supply chains powered by sophisticated logistics. This requires a significant amount of data to track shipments and operational characteristics of materials and goods. Roambee is a platform that collects, integrates, and analyzes all of that information to provide companies with the critical insights that businesses need to stay running, especially in a time of such constant change. In this episode Roambee CEO, Sanjay Sharma, shares the types of questions that companies are asking about their logistics, the technical work that they do to provide ways to answer those questions, and how they approach the challenge of data quality in its many forms.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Sanjay Sharma about how Roambee is using data to bring visibility into shipping and supply chains.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Roambee is and the story behind it?\nWho are the personas that are looking to Roambee for insights?\nWhat are some of the questions that they are asking about the state of their assets?\nCan you describe the types of information sources and the format of the data that you are working with?\nWhat are the types of SLAs that you are focused on delivering to your customers? (e.g. latency from recorded event to analytics, accuracy, etc.)\nCan you describe how the Roambee platform is implemented?\n\nHow have the evolving landscape of sensor and data technologies influenced the evolution of your service?\n\n\nGiven your support for customer-created integrations and user-generated inputs on shipment updates, how do you manage data quality and consistency?\nHow do you approach customer onboarding, and what is your approach to reducing the time to value?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Roambee platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Roambee?\nWhen is Roambee the wrong choice?\nWhat do you have planned for the future of Roambee?\n\nContact Info\n\nLinkedIn\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nRoambee\nRFID == Radio Frequency Identification\nEDI == Electronic Data Interchange\nDigital Twin\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The global economy is dependent on complex and dynamic networks of supply chains powered by sophisticated logistics. This requires a significant amount of data to track shipments and operational characteristics of materials and goods. Roambee is a platform that collects, integrates, and analyzes all of that information to provide companies with the critical insights that businesses need to stay running, especially in a time of such constant change. In this episode Roambee CEO, Sanjay Sharma, shares the types of questions that companies are asking about their logistics, the technical work that they do to provide ways to answer those questions, and how they approach the challenge of data quality in its many forms.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Closing Announcements

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Roambee CEO, Sanjay Sharma, about the implicit complexities of global supply chains and logistics, and how they are integrating multiple sources of information to power operational analytics and improve efficiency","date_published":"2022-10-02T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7767b894-a5c4-45c1-acbd-e7e78dc36fef.mp3","mime_type":"audio/mpeg","size_in_bytes":41032199,"duration_in_seconds":3603}]},{"id":"podlove-2022-10-03t01:11:20+00:00-3fdbebb1a1f4272","title":"Make Data Lineage A Ubiquitous Part Of Your Work By Simplifying Its Implementation With Alvin","url":"https://www.dataengineeringpodcast.com/alvin-data-lineage-service-episode-329","content_text":"Summary\nData lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Martin Sahlen about his work on data lineage at Alvin and how it factors into the day-to-day work of data engineers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Alvin is and the story behind it?\nWhat is the core problem that you are trying to solve at Alvin?\nData lineage has quickly become an overloaded term. What are the elements of lineage that you are focused on addressing?\n\nWhat are some of the other sources/pieces of information that you integrate into the lineage graph?\n\n\nHow does data lineage show up in the work of data engineers?\n\nIn what ways does your focus on data engineers inform the way that you model the lineage information?\n\n\nAs with every data asset/product, the lineage graph is only as useful as the data that it stores. What are some of the ways that you focus on establishing and ensuring a complete view of lineage?\n\nHow do you account for assets (e.g. tables, dashboards, exports, etc.) that are created outside of the \"officially supported\" methods? (e.g. someone manually runs a SQL create statement, etc.)\n\n\nCan you describe how you have implemented the Alvin platform?\n\nHow have the design and goals shifted from when you first started exploring the problem?\n\n\nWhat are the types of data systems/assets that you are focused on supporting? (e.g. data warehouses vs. lakes, structured vs. unstructured, which BI tools, etc.)\nHow does Alvin fit into the workflow of data engineers and their downstream customers/collaborators?\n\nWhat are some of the design choices (both visual and functional) that you focused on to avoid friction in the data engineer’s workflow?\n\n\nWhat are some of the open questions/areas for investigation/improvement in the space of data lineage?\n\nWhat are the factors that contribute to the difficulty of a truly holistic and complete view of lineage across an organization?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Alvin used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Alvin?\nWhen is Alvin the wrong choice?\nWhat do you have planned for the future of Alvin?\n\nContact Info\n\nLinkedIn\n@martinsahlen on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAlvin\nUnacast\nsqlparse Python library\nCython\n\nPodcast.__init__ Episode\n\n\nAntlr\nKotlin programming language\nPostgreSQL\n\nPodcast Episode\n\n\nOpenSearch\nElasticSearch\nRedis\nKubernetes\nAirflow\nBigQuery\nSpark\nLooker\nMode\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Martin Sahlen about how the Alvin platform simplifies the work of collecting and analyzing data lineage so that it can be used more effectively by data engineers","date_published":"2022-10-02T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/920b4136-cca5-4c90-9eab-c75329f8b766.mp3","mime_type":"audio/mpeg","size_in_bytes":48756800,"duration_in_seconds":3376}]},{"id":"podlove-2022-09-25t21:12:39+00:00-23ceb7ef313b256","title":"Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations","url":"https://www.dataengineeringpodcast.com/fivetran-change-data-capture-episode-327","content_text":"Summary\nData integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of change data capture and the state of streaming data integration in the modern data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are some of the notable changes/advancements at Fivetran in the last 3 years?\n\nHow has the scale and scope of usage for real-time data changed in that time?\n\n\nWhat are some of the differences in usage for real-time CDC data vs. event streams that have been the driving force for a large amount of real-time data?\nWhat are some of the architectural shifts that are necessary in an organizations data platform to take advantage of CDC data streams?\n\nWhat are some of the shifts in e.g. cloud data warehouses that have happened/are happening to allow for ingestion and timely processing of these data feeds?\n\n\nWhat are some of the different ways that CDC is implemented in different source systems?\n\nWhat are some of the ways that CDC principles might start to bleed into e.g. APIs/SaaS systems to allow for more unified processing patterns across data sources?\n\n\nWhat are some of the architectural/design changes that you have had to make to provide CDC for your customers at Fivetran?\nWhat are the most interesting, innovative, or unexpected ways that you have seen CDC used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on CDC at Fivetran?\nWhen is CDC the wrong choice?\nWhat do you have planned for the future of CDC at Fivetran?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nFivetran\n\nPodcast Episode\n\n\nHVR Software\nChange Data Capture\nDebezium\n\nPodcast Episode\n\n\nLogMiner\nMaterialize\n\nPodcast Episode\n\n\nKafka\nKinesis\ndbt\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Mark Van de Wiel about the challenges of building reliable change data capture integrations and the work that Fivetran is doing to let you drive real-time analytics without the headache","date_published":"2022-09-25T21:20:03.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b9aabd47-1bf2-4484-b921-77ae080c3eef.mp3","mime_type":"audio/mpeg","size_in_bytes":38614317,"duration_in_seconds":2976}]},{"id":"podlove-2022-09-26t01:18:48+00:00-d94e4fdf37f8c3c","title":"Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language","url":"https://www.dataengineeringpodcast.com/sodacl-data-reliability-engineering-episode-328","content_text":"Summary\nRegardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Tom Baeyens about Soda Data’s new DSL for data reliability\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SodaCL is and the story behind it?\n\nWhat is the scope of functionality that SodaCL is intended to address?\n\n\nWhat are the ways that reliability is measured for data assets? (what is the equivalent to site uptime?)\nWhat are the core abstractions that you identified for simplifying the declaration of data validations?\nHow did you approach the design of the SodaCL syntax to balance flexibility for various use cases, with structure and opinionated application?\n\nWhy YAML?\n\n\nCan you describe how the Soda Core utility is implemented?\n\nHow have the design and scope of the SodaCL dialect and the Soda Core framework evolved since you started working on them?\n\n\nWhat are the available integration/extension points for teams who are using Soda Core?\nCan you describe how SodaCL integrates into the workflow of data and analytics engineers?\nWhat is your process for evolving the SodaCL dialect in a maintainable and sustainable manner?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SodaCL used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SodaCL?\nWhen is SodaCL the wrong choice?\nWhat do you have planned for the future of SodaCL?\n\nContact Info\n\nLinkedIn\n@tombaeyens on Twitter\ntombaeyens on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSoda Data\n\nPodcast Episode\n\n\nSoda Checks Language\nGreat Expectations\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tom Baeyens about the Soda Checks Language and how it was designed to express the various concerns involved in data reliability engineering in a format that is approachable by everyone.","date_published":"2022-09-25T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ca9eebf0-6db6-4394-b83d-1785aeeaa675.mp3","mime_type":"audio/mpeg","size_in_bytes":36666011,"duration_in_seconds":2461}]},{"id":"podlove-2022-09-19t01:23:16+00:00-179d7a2a5b6532e","title":"Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica","url":"https://www.dataengineeringpodcast.com/opsanalitica-multi-location-business-operational-analytics-episode-325","content_text":"Summary\nIn order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYou wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free \"In Data We Trust World Tour\" t-shirt.\nYour host is Tobias Macey and today I’m interviewing Tommy Yionoulis about using data to improve efficiencies in multi-location service businesses with OpsAnalitica\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what OpsAnalitica is and the story behind it?\nWhat are some examples of the types of questions that business owners and site managers need to answer in order to run their operations?\n\nWhat are the sources of information that are needed to be able to answer these questions?\nIn the absence of a platform like OpsAnalitica, how are business operations getting the answers to these questions?\n\n\nWhat are some of the sources of inefficiency that they are contending with?\n\nHow do those inefficiencies compound as you scale the number of locations?\n\n\nCan you describe how the OpsAnalitica system is implemented?\n\nHow have the design and goals of the platform evolved since you started working on it?\n\n\nCan you describe the workflow for a business using OpsAnalitica?\nWhat are some of the biggest integration challenges that you have to address?\nWhat are some of the design elements that you have invested in to reduce errors and complexity for employees tracking relevant metrics?\nWhat are the most interesting, innovative, or unexpected ways that you have seen OpsAnalitica used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on OpsAnalitica?\nWhen is OpsAnalitica the wrong choice?\nWhat do you have planned for the future of OpsAnalitica?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nOpsAnalitica\nQuiznos\nFormRouter\nCooper Atkins(?)\nSensorThings API\nThe Founder movie\nToast\nLooker\n\nPodcast Episode\n\n\nPower BI\n\nPodcast Episode\n\n\nPareto Principle\nDecisions workflow platform\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"In this episode Tommy Yionoulis talks about how incorporating deliberate data collection into business processes can drive important operational insights in multi-location businesses and his work at OpsAnalitica to make it manageable.","date_published":"2022-09-18T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c6727eef-2dfd-4d49-b5f1-e3e727e324ce.mp3","mime_type":"audio/mpeg","size_in_bytes":59805176,"duration_in_seconds":5523}]},{"id":"podlove-2022-09-19t01:36:48+00:00-cbff5e5617134e8","title":"Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream","url":"https://www.dataengineeringpodcast.com/workstream-data-asset-collaboration-episode-326","content_text":"Summary\nThere is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream. In this episode he discusses the challenge of maintaining shared visibility and understanding of data work across the various stakeholders and his efforts to make it a seamless experience.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nData engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.\nYour host is Tobias Macey and today I’m interviewing Nicholas Freund about Workstream, a platform aimed at providing a single pane of glass for analytics in your organization\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Workstream is and the story behind it?\nWhat is the core problem that you are trying to solve at Workstream?\n\nHow does that problem manifest for the different stakeholders in an organization?\n\n\nWhat are the contributing factors that lead to fragmentation of visibility for data workflows at different stages?\n\nWhat are the sources of information that you use to build a cohesive view of an organization’s data assets?\n\n\nWhat are the lifecycle stages of a data asset that are most often overlooked or un-maintained?\n\nWhat are the risks and challenges associated with retirement of a data asset?\n\n\nCan you describe how Workstream is implemented?\n\nHow have the design and goals of the system changed since you first started it?\n\n\nWhat does the day-to-day interaction with workstream look like for different roles in a company?\nWhat are the long-range impacts on team behaviors/productivity/capacity that you hope to catalyze?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Workstream used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Workstream?\nWhen is Workstream the wrong choice?\nWhat do you have planned for the future of Workstream?\n\nContact Info\n\nLinkedIn\n@nickfreund on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nWorkstream\nData Catalog\nEntropy\nCDP == Customer Data Platform\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream. In this episode he discusses the challenge of maintaining shared visibility and understanding of data work across the various stakeholders and his efforts to make it a seamless experience.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nicholas Freund about his efforts at Workstream to build a single view of data assets and their status across the organization in a context that is understandable by everyone.","date_published":"2022-09-18T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/465c62f8-6904-4263-a861-455c2a44fbc1.mp3","mime_type":"audio/mpeg","size_in_bytes":41018119,"duration_in_seconds":3291}]},{"id":"podlove-2022-09-11t12:03:17+00:00-4b4495760eea8a3","title":"Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data","url":"https://www.dataengineeringpodcast.com/hevo-data-pipeline-platform-episode-323","content_text":"Summary\nAny business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Manish Jethani about Hevo Data’s experiences navigating the modern data stack and the role of ELT in data workflows\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Hevo Data is and the story behind it?\nWhat is the core problem that you are trying to solve with the Hevo platform?\n\nWhat are the target personas of who will bring Hevo into a company and who will be using/interacting with it for their day-to-day?\n\n\nWhat are some of the lessons that you learned building a product that relied on data to function which you have carried into your work at Hevo, providing the utilities that enable other businesses and products?\nThere are numerous commercial and open source options for collecting, transforming, and integrating data. What are the differentiating features of Hevo?\n\nWhat are your views on the benefits of a vertically integrated platform for data flows in the world of the disaggregated \"modern data stack\"?\n\n\nCan you describe how the Hevo platform is implemented?\n\nWhat are some of the optimizations that you have invested in to support the aggregate load from your customers?\n\n\nThe predominant pattern in recent years for collecting and processing data is ELT. In your work at Hevo, what are some of the nuance and exceptions to that \"best practice\" that you have encountered?\n\nHow have you factored those learnings back into the product?\n\n\nmechanics of schema mapping\n\nedge cases that require human intervention\n\nhow to surface those in a timely fashion\n\n\n\n\nWhat is the process for onboarding onto the Hevo platform?\n\nOnce an organization has adopted Hevo, can you describe the workflow of building/maintaining/evolving data pipelines?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Hevo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Hevo?\nWhen is Hevo the wrong choice?\nWhat do you have planned for the future of Hevo?\n\nContact Info\n\nLinkedIn\n@ManishJethani on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nHevo Data\nKafka\nMongoDB\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Sifflet: ![Sifflet](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/z-fy2Hbs.png)\r\n\r\nSifflet is a Full Data Stack Observability platform acting as an overseeing layer to the Data Stack, ensuring that data is reliable from ingestion to consumption. Whether the data is in transit or at rest, Sifflet is able to detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack.\r\n\r\nIn addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. We also offer a 2-week free trial.\r\n\r\nGo to <u>[dataengineeringpodcast.com/sifflet](https://www.dataengineeringpodcast.com/sifflet)</u> to find out more.","content_html":"

Summary

\n

Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Manish Jethani about the Hevo Data platform for building end-to-end data pipelines that automate flows from source systems, into the warehouse, and out to operational platforms without all of the maintenance overhead.","date_published":"2022-09-11T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6f8b0ab5-9da7-4fa9-b359-e720f2a7e2d2.mp3","mime_type":"audio/mpeg","size_in_bytes":39636055,"duration_in_seconds":3435}]},{"id":"podlove-2022-09-11t12:11:15+00:00-b3919434f0754c0","title":"Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata","url":"https://www.dataengineeringpodcast.com/schemata-schema-compatibility-utility-episode-324","content_text":"Summary\nData engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\n\n\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\n\n\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\n\n\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\n\n\nYour host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.\n\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Schemata is and the story behind it?\n\nHow does the garbage in/garbage out problem manifest in data warehouse/data lake environments?\n\n\nWhat are the different places in a data system that schema definitions need to be established?\n\nWhat are the different ways that schema management gets complicated across those various points of interaction?\n\n\nCan you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?\n\nHow does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?\n\n\nHow is the Schemata utility implemented?\n\nWhat are some of the design and scope questions that you had to work through while developing Schemata?\n\n\nWhat is the broad vision that you have for Schemata and its impact on data practices?\nHow are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?\nThe core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?\nWhat are the pieces of Schemata and its usage that are still undefined?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Schemata used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?\nWhen is Schemata the wrong choice?\nWhat do you have planned for the future of Schemata?\n\nContact Info\n\nananthdurai on GitHub\n@ananthdurai on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSchemata\nData Engineering Weekly\nZendesk\nRalph Kimball\nData Warehouse Toolkit\nIteratively\n\nPodcast Episode\n\n\nProtocol Buffers (protobuf)\nApplication Tracing\nOpenTelemetry\nDjango\nSpring Framework\nDependency Injection\nJSON Schema\ndbt\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ananth Packildurai about the Schemata project and how it provides visibility into the connections and compatibility of schemas that flow from source systems through all of your transformations and into your data assets.","date_published":"2022-09-11T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4ad45e2b-2324-4574-97cf-f0680f742648.mp3","mime_type":"audio/mpeg","size_in_bytes":45049685,"duration_in_seconds":3579}]},{"id":"podlove-2022-09-04t17:40:43+00:00-e2613a9b6044680","title":"A Reflection On Data Observability As It Reaches Broader Adoption","url":"https://www.dataengineeringpodcast.com/monte-carlo-data-observability-industry-adoption-episode-321","content_text":"Summary\nData observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you give the elevator pitch for Monte Carlo?\n\nWhat are the notable changes in the Monte Carlo product and business since our last conversation in October 2020?\n\n\nYou were one of the early entrants in the market of data quality/data observability products. In your work to gain visibility and traction you invested substantially in content creation (blog posts, presentations, round table conversations, etc.). How would you summarize the focus of your initial efforts?\nWhy do you think data observability has really taken off? A few years ago, the category barely existed – what’s changed?\nThere’s a larger debate within the data engineering community regarding whether it makes sense to go deep or go broad when it comes to monitoring your data. In other words, do you start with a few important data sets, or do you attempt to cover the entire ecosystem. What is your take?\nFor engineers and teams who are just now investigating and investing in observability/quality automation for their data, what are their motivations?\nHow has the conversation around the value/motivating factors matured or changed over the past couple of years?\n\nIn what way have the requirements and capabilities of data observability platforms shifted?\n\nWhat are the forces in the ecosystem that have driven those changes?\n\n\nHow has the scope and vision for your work at Monte Carlo evolved as the understanding and impact of data quality have become more widespread?\n\n\nWhen teams invest in data quality/observability what are some of the ways that the insights gained influence their other priorities and design choices? (e.g. platform design, pipeline design, data usage, etc.)\n\nWhen it comes to selecting what parts of the data stack to invest in, how do data leaders prioritize? For instance, when does it make sense to build or buy a data catalog? A data observability platform?\n\n\nThe adoption of any tool that adds constraints is a delicate balance. What have you found to be the predominant patterns for teams who are incorporating Monte Carlo? (e.g. maintaining delivery velocity and adding safety/trust)\nA corollary to the goal of data engineers for higher reliability and visibility is the need by the business/team leadership to identify \"return on investment\". How do you and your customers think about the useful metrics and measurement goals to justify the time spent on \"non-functional\" requirements?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Monte Carlo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Monte Carlo?\nWhen is Monte Carlo the wrong choice?\nWhat do you have planned for the future of Monte Carlo?\n\nContact Info\n\nBarr\n\nLinkedIn\n@BM_DataDowntime on Twitter\n\n\nLior\n\nLinkedIn\n@lgavish on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nMonte Carlo\n\nPodcast Episode\n\n\nApp Dynamics\nDatadog\nNew Relic\nData Quality Fundamentals book\nState Of Data Quality Survey\ndbt\n\nPodcast Episode\n\n\nAirflow\nDagster\n\nPodcast Episode\n\n\nEpisode: Incident Management For Data Teams\nDatabricks Delta\nPatch.tech Snowflake APIs\nHightouch\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"","date_published":"2022-09-04T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/092768f7-e363-4842-961a-903928d88010.mp3","mime_type":"audio/mpeg","size_in_bytes":46726935,"duration_in_seconds":3519}]},{"id":"podlove-2022-09-05t01:08:24+00:00-8639e43e7919dc2","title":"Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global","url":"https://www.dataengineeringpodcast.com/sust-global-climate-analytics-episode-322","content_text":"Summary\nThe global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Gopal Erinjippurath about his work at Sust Global building data sets from geospatial and satellite information to power climate analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Sust Global is and the story behind it?\n\nWhat audience(s) are you focused on?\n\n\nClimate change is obviously a huge topic in the zeitgeist and has been growing in importance. What are the data sources that you are working with to derive climate information?\n\nWhat role do you view Sust Global having in addressing climage change?\n\n\nHow are organizations using your climate information assets to inform their analytics and business operations?\n\nWhat are the types of questions that they are asking about the role of climate (present and future) for their business activities?\nHow can they use the climate information that you provide to understand their impact on the planet?\n\n\nWhat are some of the educational efforts that you need to undertake to ensure that your end-users understand the context and appropriate semantics of the data that you are providing? (e.g. concepts around climate science, statistically meaningful interpretations of aggregations, etc.)\nCan you describe how you have architected the Sust Global platform?\n\nWhat are some examples of the types of data workflows and transformations that are necessary to maintain your customer-facing services?\n\n\nHow have you approached the question of modeling for the data that you provide to end-users to make it straightforward to integrate and analyze the information?\n\nWhat is your process for determining relevant granularities of data and normalizing scales? (e.g. time and distance)\n\n\nWhat is involved in integrating with the Sust Global platform and how does it fit into the workflow of data engineers/analysts/data scientists at your customer organizations?\nAny analytical task is an exercise in story-telling. What are some of the techniques that you and your customers have found useful to make climate data relatable and understandable?\n\nWhat are some of the challenges involved in mapping between micro and macro level insights and translating them effectively for the consumer?\n\n\nHow does the increasing sensor capabilities and scale of coverage manifest in your data?\n\nHow do you account for increasing coverage when analyzing across longer historical time scales?\n\n\nHow do you balance the need to build a sustainable business with the importance of access to the information that you are working with?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Sust Global used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Sust Global?\nWhen is Sust the wrong choice?\nWhat do you have planned for the future of Sust Global?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSust Global\nPlanet Labs\nCarbon Capture\nIPCC\nData Lodge(?)\n6th Assessment Report\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Gopal Erinjippurath about Sust Global's work to bring climate analytics into your data platform through robust APIs and curated data sets.","date_published":"2022-09-04T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ebbc1f65-845c-4e37-aa37-a4104633d3f8.mp3","mime_type":"audio/mpeg","size_in_bytes":41090042,"duration_in_seconds":3258}]},{"id":"podlove-2022-08-29t01:37:32+00:00-1fdd04b08edff6f","title":"An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality","url":"https://www.dataengineeringpodcast.com/ascend-data-automation-episode-320","content_text":"Summary\nThe dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Sean Knapp about the role of data automation in building maintainable systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you mean by the term \"data automation\" and the assumptions that it includes?\nOne of the perennial challenges of automation is that there are always steps that are resistant to being performed without human involvement. What are some of the tasks that you have found to be common problems in that sense?\nWhat are the different concerns that need to be included in a stack that supports fully automated data workflows?\nThere was recently an interesting article suggesting that the \"left-to-right\" approach to data workflows is backwards. In your experience, what would be required to allow for triggering data processes based on the needs of the data consumers? (e.g. \"make sure that this BI dashboard is up to date every 6 hours\")\nWhat are the tasks that are most complex to build automation for?\nWhat are some companies or tools/platforms that you consider to be exemplars of \"data automation done right\"?\n\nWhat are the common themes/patterns that they build from?\n\n\nHow have you approached the need for data automation in the implementation of the Ascend product?\nHow have the requirements for data automation changed as data plays a more prominent role in a growing number of businesses?\n\nWhat are the foundational elements that are unchanging?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen data automation implemented?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data automation at Ascend?\nWhat are some of the ways that data automation can go wrong?\nWhat are you keeping an eye on across the data ecosystem?\n\nContact Info\n\n@seanknapp on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAscend\n\nPodcast Episode\n\n\nGoogle Sawzall\nCI/CD\nAirflow\nKubernetes\nAscend FlexCode\nMongoDB\nSHA == Secure Hash Algorithm\ndbt\n\nPodcast Episode\n\n\nMaterialized View\nGreat Expectations\n\nPodcast Episode\n\n\nMonte Carlo\n\nPodcast Episode\n\n\nOpenLineage\n\nPodcast Episode\n\n\nOpen Metadata\n\nPodcast Episode\n\n\nEgeria\nOOM == Out Of Memory Manager\nFive Whys\nData Mesh\nData Fabric\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Bigeye: ![Bigeye](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/qaHgbHoq.png)\r\n\r\nBigeye is an industry-leading data observability platform that gives data engineering and science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use Bigeye’s automated data quality monitoring, ML-powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.\r\n\r\nGo to <u>[dataengineeringpodcast.com/bigeye](https://www.dataengineeringpodcast.com/bigeye)</u> today and start trusting your data.\r\n","content_html":"

Summary

\n

The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Sean Knapp about the potential impact of data automation and the various considerations and capabilities that are required to make it a reality.","date_published":"2022-08-28T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c01699a9-2e46-4a16-afab-467ede5e242e.mp3","mime_type":"audio/mpeg","size_in_bytes":44153211,"duration_in_seconds":3812}]},{"id":"podlove-2022-08-28t20:50:23+00:00-7f604b2c69396ee","title":"Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations","url":"https://www.dataengineeringpodcast.com/airbnb-alumni-data-driven-organization-episode-319","content_text":"Summary\nAirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Lindsay Pettingill Chetan Sharma, Swaroop Jagadish, Maxime Beauchemin, and Nick Handel about the lessons that they learned in their time at AirBnB and how they are carrying that forward to their respective companies\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nYou all worked at AirBnB in similar time frames and then went on to found data-focused companies that are finding success in their respective categories. Do you consider it an outgrowth of the specific company culture/work involved or a curiosity of the moment in time for the data industry that led you each in that direction?\nWhat are the elements of AirBnB’s data culture that you feel were done right?\n\nWhat do you see as the critical decisions/inflection points in the company’s growth that led you down that path?\n\n\nEvery journey has its detours and dead-ends. What are the mistakes that were made (individual and collective) that were most instructive for you?\nWhat about that experience and other experiences led you each to go our respective directions with data startups?\n\nWas your motivation to start a company addressing the work that you did at AirBnB due to the desire to build on existing success, or the need to fix a nagging frustration?\n\n\nWhat are the critical lessons for data teams that you are focused on teaching to engineers inside and outside your company?\n\nWhat are your predictions for the next 5 years of data?\n\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on translating your experiences at AirBnB into successful products?\n\nContact Info\n\nLindsay\n\nLinkedIn\n@lpettingill on Twitter\n\n\nChetan\n\nLinkedIn\n@chesharma87 on Twitter\n\n\nMaxime\n\nmistercrunch on GitHub\nLinkedIn\n@mistercrunch on Twitter\n\n\nSwaroop\n\nswaroopjagadish on GitHub\nLinkedIn\n@arudis on Twitter\n\n\nNick\n\nLinkedIn\n@NicholasHandel on Twitter\nnhandel on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nIggy\nEppo\n\nPodcast Episode\n\n\nAcryl\n\nPodcast Episode\n\n\nDataHub\nPreset\nSuperset\n\nPodcast Episode\n\n\nAirflow\nTransform\n\nPodcast Episode\n\n\nDeutsche Bank\nUbisoft\nBlackRock\nKafka\nPinot\nStata\nR\nKnowledge-Repo\n\nPodcast.__init__ Episode\n\n\nAirBnB Almond Flour Cookie Recipe\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with alumni of AirBnB's formative years as a data driven organization about the lessons that they learned there and how they are carrying them forward in the founding of new data companies.","date_published":"2022-08-28T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0f01e12e-d0fe-4d2e-9f6a-6835f28ccf21.mp3","mime_type":"audio/mpeg","size_in_bytes":54878900,"duration_in_seconds":4214}]},{"id":"podlove-2022-08-22t00:50:18+00:00-dc2e45fe9d171db","title":"Understanding The Role Of The Chief Data Officer","url":"https://www.dataengineeringpodcast.com/tracy-daniels-chief-data-officer-episode-318","content_text":"Summary\nThe position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Tracy Daniels about the role and responsibilities of the Chief Data Officer and how it is evolving along with the ecosystem\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your path to CDO of Truist has been?\n\nAs a CDO, what are your responsibilities and scope of influence?\n\n\nNot every organization has an explicit position for the CDO. What are the factors that determine when that should be a distinct role?\n\nWhat is the relationship and potential overlap with a CTO?\n\n\nAs the CDO of Truist, what are some of the projects/activities that are vying for your time and attention?\nCan you share the composition of your teams and how you think about organizational structure and integration for data professionals in your company?\nWhat are the industry and business trends that are having the greatest impact on your work as a CDO?\n\nHow has your role evolved over the past few years?\n\n\nWhat are some of the organizational politics/pressures that you have had to navigate to achieve your objectives?\n\nWhat are some of the ways that priorities at the C-level can be at cross purposes to that of the CDO?\n\n\nWhat are some of the skills and experiences that you have found most useful in your work as CDO?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the CDO position/responsibilities addressed in other organizations?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working as a CDO?\nWhen is a distinct CDO position the wrong choice for an organization?\nWhat advice do you have for anyone who is interested in charting a career path to the CDO seat?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTruist\nChief Data Officer\nChief Analytics Officer\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tracy Daniels, CDO of Truist, about the role and responsibilities of the Chief Data Officer and when your organization might need one","date_published":"2022-08-21T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7d8bbbd0-0a69-4f30-98bb-b263c8b6703c.mp3","mime_type":"audio/mpeg","size_in_bytes":38627116,"duration_in_seconds":2830}]},{"id":"podlove-2022-08-22t00:27:03+00:00-7b00be681c51f0f","title":"An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications","url":"https://www.dataengineeringpodcast.com/rockset-real-time-data-applications-episode-317","content_text":"Summary\nData has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Shruti Bhat about the growth of real-time data applications and the systems required to support them\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what is driving the adoption of real-time analytics?\narchitectural patterns for real-time analytics\nsources of latency in the path from data creation to end-user\nend-user/customer expectations for time to insight\n\ndiffering expectations between internal and external consumers\n\n\nscales of data that are reasonable for real-time vs. batch\nWhat are the most interesting, innovative, or unexpected ways that you have seen real-time architectures implemented?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Rockset?\nWhen is Rockset the wrong choice?\nWhat do you have planned for the future of Rockset?\n\nContact Info\n\nLinkedIn\n@shrutibhat on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nRockset\n\nPodcast Episode\n\n\nEmbedded Analytics\nConfluent\nKafka\nAWS Kinesis\nLambda Architecture\nData Observability\nData Mesh\nDynamoDB Streams\nMongoDB Change Streams\nBigeye\nMonte Carlo Data\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shruti Bhat about the state of the ecosystem for real-time data applications and the motivating factors for when and how to build them.","date_published":"2022-08-21T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b45cd1bc-c8bf-4e52-8fe6-5b94f6883c29.mp3","mime_type":"audio/mpeg","size_in_bytes":47514451,"duration_in_seconds":3979}]},{"id":"podlove-2022-08-14t01:17:07+00:00-3566b28f83bf591","title":"Bringing Automation To Data Labeling For Machine Learning With Watchful","url":"https://www.dataengineeringpodcast.com/watchful-data-labeling-automation-episode-316","content_text":"Summary\nData engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Shayan Mohanty about Watchful, a data-centric platform for labeling your machine learning inputs\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Watchful is and the story behind it?\nWhat are your core goals at Watchful?\n\nWhat problem are you solving and who are the people most impacted by that problem?\n\n\nWhat is the role of the data engineer in the process of getting data labeled for machine learning projects?\nData labeling is a large and competitive market. How do you characterize the different approaches offered by the various platforms and services?\nWhat are the main points of friction involved in getting data labeled?\n\nHow do the types of data and its applications factor into how those challenges manifest?\nWhat does Watchful provide that allows it to address those obstacles?\n\n\nCan you describe how Watchful is implemented?\n\nWhat are some of the initial ideas/assumptions that you have had to re-evaluate?\nWhat are some of the ways that you have had to adjust the design of your user experience flows since you first started?\n\n\nWhat is the workflow for teams who are adopting Watchful?\n\nWhat are the types of collaboration that need to happen in the data labeling process?\nWhat are some of the elements of shared vocabulary that different stakeholders in the process need to establish to be successful?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Watchful used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Watchful?\nWhen is Watchful the wrong choice?\nWhat do you have planned for the future of Watchful?\n\nContact Info\n\nLinkedIn\n@shayanjm on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nWatchful\nEntity Resolution\nSupervised Machine Learning\nBERT\nCLIP\nLabelBox\nLabel Studio\nSnorkel AI\n\nMachine Learning Podcast Episode\n\n\nRegEx == Regular Expression\nREPL == Read Evaluate Print Loop\nIDE == Integrated Development Environment\nTuring Completeness\nClojure\nRust\nNamed Entity Recognition\nThe Halting Problem\nNP Hard\nLidar\nShayan: Arguments Against Hand Labeling\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shayan Mohanty about the challenges of building repeatable data labeling processes and how Watchful is building a platform to let domain experts codify their knowledge for automated labeling of training data for machine learning projects.","date_published":"2022-08-13T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/df6cb4c8-4532-4d4c-a13d-0f0ffb8ae54d.mp3","mime_type":"audio/mpeg","size_in_bytes":63601154,"duration_in_seconds":4829}]},{"id":"podlove-2022-08-13t19:51:35+00:00-4140386e4e11191","title":"Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery","url":"https://www.dataengineeringpodcast.com/selectstar-context-driven-data-discovery-episode-315","content_text":"Summary\nData is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nThe biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you share your definition of \"data discovery\" and the technical/social/process components that are required to make it viable?\n\nWhat are the differences between \"data discovery\" and the capabilities of a \"data catalog\" and how do they overlap?\n\n\ndiscovery of assets outside the bounds of the warehouse\ncapturing and codifying tribal knowledge\ncreating a useful structure/framework for capturing data context and operationalizing it\nWhat are the most interesting, innovative, or unexpected ways that you have seen data discovery implemented?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data discovery at SelectStar?\nWhen might a data discovery effort be more work than is required?\nWhat do you have planned for the future of SelectStar?\n\nContact Info\n\nLinkedIn\n@shinjikim on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nSelect Star\n\nPodcast Episode\n\n\nFivetran\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nTableau\nPowerBI\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shinji Kim about the challenges of collecting contextual metadata for your information assets and how to organize it to power effective data discovery for everyone in the business","date_published":"2022-08-13T21:15:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/80a23705-f430-41f6-8c27-8b7b6ab0e314.mp3","mime_type":"audio/mpeg","size_in_bytes":43075678,"duration_in_seconds":3204}]},{"id":"podlove-2022-08-06t13:50:11+00:00-187677c91f31a15","title":"Useful Lessons And Repeatable Patterns Learned From Data Mesh Implementations At AgileLab","url":"https://www.dataengineeringpodcast.com/agilelab-data-mesh-boost-episode-314","content_text":"Summary\nData mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Paolo Platter about Agile Lab’s lessons learned through helping large enterprises establish their own data mesh\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you share your experiences working with data mesh implementations?\nWhat were the stated goals of project engagements that led to data mesh implementations?\nWhat are some examples of projects where you explored data mesh as an option and decided that it was a poor fit?\nWhat are some of the technical and process investments that are necessary to support a mesh strategy?\nWhen implementing a data mesh what are some of the common concerns/requirements for building and supporting data products?\n\nWhat are the general shape that a product will take in a mesh environment?\nWhat are the features that are necessary for a product to be an effective component in the mesh?\n\n\nWhat are some of the aspects of a data product that are unique to a given implementation?\nYou built a platform for implementing data meshes. Can you describe the technical elements of that system?\n\nWhat were the primary goals that you were addressing when you decided to invest in building Data Mesh Boost?\n\n\nHow does Data Mesh Boost help in the implementation of a data mesh?\nCode review is a common practice in construction and maintenance of software systems. How does that activity map to data systems/products?\nWhat are some of the challenges that you have encountered around CI/CD for data products?\n\nWhat are the persistent pain points involved in supporting pre-production validation of changes to data products?\n\n\nBeyond the initial work of building and deploying a data product there is the ongoing lifecycle management. How do you approach refactoring old data products to match updated practices/templates?\nWhat are some of the indicators that tell you when an organization is at a level of sophistication that can support a data mesh approach?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Data Mesh Boost used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Mesh Boost?\nWhen is Data Mesh (Boost) the wrong choice?\nWhat do you have planned for the future of Data Mesh Boost?\n\nContact Info\n\nLinkedIn\n@axlpado on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAgileLab\nSpark\nCloudera\nZhamak Dehghani\nData Mesh\nData Fabric\nData Virtualization\nq-lang\nData Mesh Boost\nData Mesh Marketplace\nSourceGraph\nOpenMetadata\nEgeria\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Paolo Platter about the experience that he and his team at AgileLab have had implementing Data Mesh strategies at multiple organizations and the repeatable patterns that they have built into their Data Mesh Boost product.","date_published":"2022-08-06T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2ac41684-00f3-425b-a50c-4278af0837fb.mp3","mime_type":"audio/mpeg","size_in_bytes":36282345,"duration_in_seconds":2910}]},{"id":"podlove-2022-08-06t13:30:07+00:00-c3dc42a40970617","title":"Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus","url":"https://www.dataengineeringpodcast.com/milvus-open-source-vector-database-episode-313","content_text":"Summary\nThe optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Frank Liu about the open source vector database Milvus and how it simplifies the work of supporting ML teams\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Milvus is and the story behind it?\nWhat are the goals of the project?\n\nWho is the target audience for this database?\n\n\nWhat are the use cases for a vector database and similarity search of vector embeddings?\n\nWhat are some of the unique capabilities that this category of database engine introduces?\n\n\nCan you describe how Milvus is architected?\n\nWhat are the primary system requirements that have influenced the design choices?\nHow have the goals and implementation evolved since you started working on it?\n\n\nWhat are some of the interesting details that you have had to address in the storage layer to allow for fast and efficient retrieval of vector embeddings?\nWhat are the limitations that you have had to impose on size or dimensionality of vectors to allow for a consistent user experience in a running system?\n\nThe reference material states that similarity between two vectors implies similarity in the source data. What are some of the characteristics of vector embeddings that might make them immune or susceptible to confusion of similarity across different source data types that share some implicit relationship due to specifics of their vectorized representation? (e.g. an image vs. an audio file, etc.)\n\n\nWhat are the available deployment models/targets and how does that influence potential use cases?\nWhat is the workflow for someone who is building an application on top of Milvus?\nWhat are some of the data management considerations that are introduced by vector databases? (e.g. manage versions of vectors, metadata management, etc.)\nWhat are the most interesting, innovative, or unexpected ways that you have seen Milvus used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Milvus?\nWhen is Milvus the wrong choice?\nWhat do you have planned for the future of Milvus?\n\nContact Info\n\nLinkedIn\nfzliu on GitHub\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nMilvus\nZilliz\nLinux Foundation/AI & Data\nMySQL\nPostgreSQL\nCockroachDB\nPilosa\n\nPodcast Episode\n\n\nPinecone Vector DB\n\nPodcast Episode\n\n\nVector Embedding\nReverse Image Search\nVector Arithmetic\nVector Distance\nSIGMOD\nTensor\nRotation Matrix\nL2 Distance\nCosine Distance\nOpenAI CLIP\nKnowhere\nKafka\nPulsar\n\nPodcast Episode\n\n\nCAP Theorem\nMilvus Helm Chart\nZilliz Cloud\nMinIO\nTowhee\nAttu\nFeder\nFPGA == Field Programmable Gate Array\nTPU == Tensor Processing Unit\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Frank Liu about the open source vector database Milvus and how its native storage of vector embeddings reduces the friction involved in building and deploying machine learning models.","date_published":"2022-08-06T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/803fd633-5173-44b6-b47a-199ba3e9a791.mp3","mime_type":"audio/mpeg","size_in_bytes":51008337,"duration_in_seconds":3531}]},{"id":"podlove-2022-07-31t20:21:43+00:00-617ea4c4c6ef5f6","title":"Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda","url":"https://www.dataengineeringpodcast.com/arkouda-big-data-exploratory-data-analysis-episode-311","content_text":"Summary\nExploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nData stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing David Bader about Arkouda, a horizontally scalable parallel compute library for exploratory data analysis in Python\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Arkouda is and the story behind it?\nWhat are the main goals of the project?\n\nHow does it address those goals?\nWho is the primary audience for Arkouda?\n\n\nWhat are some of the main points of friction that engineers and scientists encounter while conducting exploratory data analysis (EDA)?\n\nWhat kinds of behaviors are they engaging in during these exploration cycles?\n\n\nWhen data scientists run up against the limitations of their tools and environments how does that impact the work of data engineers/data platform owners?\nThere have been a number of libraries/frameworks/utilities/etc. built to improve the experience and outcomes for EDA. What was missing that made Arkouda necessary/useful?\nCan you describe how Arkouda is implemented?\n\nWhat are some of the novel algorithms that you have had to design to support Arkouda’s objectives?\nHow have the design/goals/scope of the project changed since you started working on it?\n\n\nHow has the evolution of hardware capabilities impacted the set of processing algorithms that are viable for addressing considerations of scale?\n\nWhat are the relative factors of scale along space/time axes that you are optimizing for?\nWhat are some opportunities that are still unrealized for algorithmic optimizations to expand horizons for large-scale data manipulation?\n\n\nFor teams/individuals who are working with Arkouda can you describe the implementation process and what the end-user workflow looks like?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Arkouda used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Arkouda?\nWhen is Arkouda the wrong choice?\nWhat do you have planned for the future of Arkouda?\n\nContact Info\n\nWebsite\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nArkouda\nNJIT == New Jersey Institute of Technology\nNumPy\nPandas\n\nPodcast.__init__ Episode\n\n\nNetworkX\nChapel\nMassive Graph Analytics Book\nRay\n\nPodcast.__init__ Episode\n\n\nDask\n\nPodcast Episode\n\n\nBodo\n\nPodcast Episode\n\n\nStinger Graph Analytics\nBears-R-Us\n0MQ\nTriangle Centrality\nDegree Centrality\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with David Bader about the Arkouda framework for exploratory data analysis at interactive speeds across massive data sets and how it supports operating from a single laptop to multiple servers in the cloud or thousands of cores on a supercomputer","date_published":"2022-07-31T16:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d027e1eb-bf64-4797-9e45-d3aabd3f4908.mp3","mime_type":"audio/mpeg","size_in_bytes":35936138,"duration_in_seconds":2437}]},{"id":"podlove-2022-07-31t20:32:04+00:00-c09fc81a33d67b9","title":"What \"Data Lineage Done Right\" Looks Like And How They're Doing It At Manta","url":"https://www.dataengineeringpodcast.com/manta-data-lineage-as-a-service-episode-312","content_text":"Summary\nData lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in short supply. In this episode Ernie Ostic shares the approach that he and his team at Manta are taking to build a complete view of data lineage across the various data systems in your organization and the useful applications of that information in the work of every data stakeholder.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.\nYour host is Tobias Macey and today I’m interviewing Ernie Ostic about Manta, an automated data lineage service for managing visibility and quality of your data workflows\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Manta is and the story behind it?\nWhat are the core problems that Manta aims to solve?\nData lineage and metadata systems are a hot topic right now. What is your summary of the state of the market?\n\nWhat are the capabilities that would lead a team or organization to choose Manta in place of the other options?\n\n\nWhat are some examples of \"data lineage done wrong\"? (what does that look like?)\n\nWhat are the risks associated with investing in an incomplete solution for data lineage?\nWhat are the core attributes that need to be tracked consistently to enable a comprehensive view of lineage?\n\n\nHow do the practices for collecting lineage and metadata differ between structured, semi-structured, and unstructured data assets and their movement?\nCan you describe how Manta is implemented?\n\nHow have the design and goals of the product changed or evolved?\n\n\nWhat is involved in integrating Manta with an organization’s data systems?\n\nWhat are the biggest sources of friction/errors in collecting and cleaning lineage information?\n\n\nOne of the interesting capabilities that you advertise is versioning and time travel for lineage information. Why is that a necessary and useful feature?\nOnce an organization’s lineage information is available in Manta, how does it factor into the daily workflow of different roles/stakeholders?\nThere are a variety of use cases for metadata in a data platform beyond lineage. What are the benefits that you see from focusing on that as a core competency?\nBeyond validating quality, identifying errors, etc. it seems that automated discovery of lineage could produce insights into when the presence of data assets that shouldn’t exist. What are some examples of similar discoveries that you are aware of?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Manta used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Manta?\nWhen is Manta the wrong choice?\nWhat do you have planned for the future of Manta?\n\nContact Info\n\nLinkedIn\n@dsrealtime01 on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nManta\nEgeria\nOpenLineage\n\nPodcast Episode\n\n\nApache Atlas\nNeo4J\nEasytrieve\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in short supply. In this episode Ernie Ostic shares the approach that he and his team at Manta are taking to build a complete view of data lineage across the various data systems in your organization and the useful applications of that information in the work of every data stakeholder.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ernie Ostic about the Manta platform and how it approaches the collection and processing of metadata to build a comprehensive view of data lineage across your various data systems","date_published":"2022-07-31T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2f443322-9fdd-498e-901d-2520fa62809a.mp3","mime_type":"audio/mpeg","size_in_bytes":43325181,"duration_in_seconds":3918}]},{"id":"podlove-2022-07-24t23:20:31+00:00-d579f2bd3b764d4","title":"Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering","url":"https://www.dataengineeringpodcast.com/fundamentals-of-data-engineering-episode-310","content_text":"Summary\nData engineering is a difficult job, requiring a large number of skills that often don’t overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book \"Fundamentals of Data Engineering\". In this episode they share their experiences researching and distilling the lessons that will be useful to data engineers now and into the future, without being tied to any specific technologies that may fade from fashion.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect today.\nYour host is Tobias Macey and today I’m interviewing Joe Reis and Matt Housley about their new book on the Fundamentals of Data Engineering\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you explain what possessed you to write such an ambitious book?\nWhat are your goals with this book?\nWhat was your process for determining what subject areas to include in the book?\n\nHow did you determine what level of granularity/detail to use for each subject area?\n\n\nClosely linked to what subjects are necessary to be effective as a data engineer is the concept of what that title encompasses. How have the definitions shifted over the past few decades?\n\nIn your experiences working in industry and researching for the book, what is the prevailing view on what data engineers do?\nIn the book you focus on what you term the \"data lifecycle engineer\". What are the skills and background that are needed to be successful in that role?\n\n\nAny discussion of technological concepts and how to build systems tends to drift toward specific tools. How did you balance the need to be agnostic to specific technologies while providing relevant and relatable examples?\nWhat are the aspects of the book that you anticipate needing to revisit over the next 2 – 5 years?\n\nWhich elements do you think will remain evergreen?\n\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on writing \"Fundamentals of Data Engineering\"?\nWhat are your predictions for the future of data engineering?\n\nContact Info\n\nJoe\n\nLinkedIn\nWebsite\n\n\nMatt\n\nLinkedIn\n@doctorhousley on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nFundamentals of Data Engineering (affiliate link)\nTernary Data\nDesigning Data Intensive Applications\nJames Webb Space Telescope\nGoogle Colossus Storage System\nDMBoK == Data Management Body of Knowledge\nDAMA\nBill Inmon\nApache Druid\nRTFM == Read The Fine Manual\nDuckDB\n\nPodcast Episode\n\n\nVisiCalc\nTernary Data Newsletter\nMeroxa\n\nPodcast Episode\n\n\nRuby on Rails\nLambda Architecture\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering is a difficult job, requiring a large number of skills that often don’t overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book "Fundamentals of Data Engineering". In this episode they share their experiences researching and distilling the lessons that will be useful to data engineers now and into the future, without being tied to any specific technologies that may fade from fashion.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Joe Reis and Matt Housley about their experience and insights gained while writing the book "Fundamentals of Data Engineering" and the inherent challenges of offering a single reference that covers the variety of skills necessary to work as a data engineer.","date_published":"2022-07-24T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/69fa4c61-5791-4147-90aa-f28244b97872.mp3","mime_type":"audio/mpeg","size_in_bytes":53750161,"duration_in_seconds":3662}]},{"id":"podlove-2022-07-24t23:07:50+00:00-a43454ca589de38","title":"Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster","url":"https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309","content_text":"Summary\nThe current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its \"software defined assets\" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Nick Schrock about software defined assets and improving the developer experience for data orchestration with Dagster\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the notable updates in Dagster since the last time we spoke? (November, 2021)\nOne of the core concepts that you introduced and then stabilized in recent releases is the \"software defined asset\" (SDA). How have your users reacted to this capability?\n\nWhat are the notable outcomes in development and product practices that you have seen as a result?\n\n\nWhat are the changes to the interfaces and internals of Dagster that were necessary to support SDA?\nHow did the API design shift from the initial implementation once the community started providing feedback?\nYou’re releasing the stable 1.0 version of Dagster as part of something called \"Dagster Day\" on August 9th. What do you have planned for that event and what does the release mean for users who have been refraining from using the framework until now?\nAlong with your 1.0 commitment to a stable interface in the framework you are also opening your cloud platform for general availability. What are the major lessons that you and your team learned in the beta period?\n\nWhat new capabilities are coming with the GA release?\n\n\nA core thesis in your work on Dagster is that developer tooling for data professionals has been lacking. What are your thoughts on the overall progress that has been made as an industry?\n\nWhat are the sharp edges that still need to be addressed?\n\n\nA core facet of product-focused software development over the past decade+ is CI/CD and the use of pre-production environments for testing changes, which is still a challenging aspect of data-focused engineering. How are you thinking about those capabilities for orchestration workflows in the Dagster context?\n\nWhat are the missing pieces in the broader ecosystem that make this a challenge even with support from tools and frameworks?\nHow has the situation improved in the recent past and looking toward the near future?\nWhat role does the SDA approach have in pushing on these capabilities?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Dagster used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on bringing Dagster to 1.0 and cloud to GA?\nWhen is Dagster/Dagster Cloud the wrong choice?\nWhat do you have planned for the future of Dagster and Elementl?\n\nContact Info\n\n@schrockn on Twitter\nschrockn on GitHub\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nDagster Day\nDagster\n\n1st Podcast Episode\n2nd Podcast Episode\n\n\nElementl\nGraphQL\nUnbundling Airflow\nFeast\nSpark SQL\nDagster Cloud Branch Deployments\nDagster custom I/O manager\nLakeFS\nIceberg\nProject Nessie\nPrefect\n\nPrefect Orion\n\n\nAstronomer\nTemporal\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nick Schrock about the role of the data orchestration engine in making sense of the modern data stack and how Dagster's support for software defined assets simplifies the work of building and understanding the flow of data in your platform.","date_published":"2022-07-24T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cfa70eb6-b36b-44a0-96f2-a71b65a901bd.mp3","mime_type":"audio/mpeg","size_in_bytes":45856584,"duration_in_seconds":3494}]},{"id":"podlove-2022-07-17t23:24:41+00:00-51663ba9b07b1c7","title":"Making The Total Cost Of Ownership For External Data Manageable With Crux","url":"https://www.dataengineeringpodcast.com/crux-managed-external-data-integration-episode-308","content_text":"Summary\nThere are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nTired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!\nYour host is Tobias Macey and today I’m interviewing Mark Etherington about Crux, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Crux is and the story behind it?\nWhat are the categories of information that organizations use external data sources for?\nWhat are the challenges and long-term costs related to integrating external data sources that are most often overlooked or underestimated?\n\nWhat are some of the primary risks involved in working with external data sources?\n\n\nHow do you work with customers to help them understand the long-term costs associated with integrating various sources?\n\nHow does that play into the broader conversation about assessing the value of a given data-set?\n\n\nCan you describe how you have architected the Crux platform?\n\nHow have the design and goals of the platform changed or evolved since you started working on it?\nWhat are the design choices that have had the most significant impact on your ability to reduce operational complexity and maintenance overhead for the data you are working with?\n\n\nFor teams who are relying on Crux to manage external data, what is involved in setting up the initial integration with your system?\n\nWhat are the steps to on-board new data sources?\n\n\nHow do you manage data quality/data observability across your different data providers?\n\nWhat kinds of signals do you propagate to your customers to feed into their operational platforms?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Crux used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Crux?\nWhen is Crux the wrong choice?\nWhat do you have planned for the future of Crux?\n\nContact Info\n\nEmail\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nCrux\nThomson Reuters\nGoldman Sachs\nJP Morgan\nAvro\nESG == Environmental, Social, Government Data\nSelenium\nGoogle Cloud Platform\nCadence\nAirflow\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Shipyard: ![Shipyard](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/v99MkWSB.png)\r\n\r\nShipyard is an orchestration platform that helps data teams build out solid data operations from the get-go by connecting data tools and streamlining data workflows. Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build workflows while enabling engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.\r\n\r\nObservability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams. With a high level of concurrency, scalability, and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them. Go to <u>[dataengineeringpodcast.com/shipyard](https://www.dataengineeringpodcast.com/shipyard)</u> to get started automating powerful workflows with their free developer plan today!","content_html":"

Summary

\n

There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Mark Etherington, CTO of Crux, about the cost and complexity involved in external data integration and how their platform is engineered to make it manageable for organizations of all sizes","date_published":"2022-07-17T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ddf6f5f2-f205-4dfe-96ba-88af9578fd1c.mp3","mime_type":"audio/mpeg","size_in_bytes":53415154,"duration_in_seconds":4032}]},{"id":"podlove-2022-07-17t23:05:52+00:00-4dfafe02971e328","title":"Joe Reis Flips The Script And Interviews Tobias Macey About The Data Engineering Podcast","url":"https://www.dataengineeringpodcast.com/joe-reis-flips-the-script-episode-307","content_text":"Summary\nData engineering is a large and growing subject, with new technologies, specializations, and \"best practices\" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of \"Fundamentals of Data Engineering\", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today we’re flipping the script. Joe Reis of Ternary Data will be interviewing me about my time as the host of this show and my perspectives on the data ecosystem\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nNow I’ll hand it off to Joe…\n\nJoe’s Notes\n\nYou do a lot of podcasts. Why? Podcast.init started in 2015, and your first episode of Data Engineering was published January 14, 2017. Walk us through the start of these podcasts.\nwhy not a data science podcast? why DE?\nYou’ve published 306 of shows of the Data Engineering Podcast, plus 370 for the init podcast, then you’ve got a new ML podcast. How have you kept the motivation over the years?\nWhat’s the process for the show (finding guests, topics, etc….recording, publishing)? It’s a lot of work. Walk us through this process.\nYou’ve done a ton of shows and have a lot of context with what’s going on in the field of both data engineering and Python. What have been some of the major evolutions of topics you’ve covered?\nWhat’s been the most counterintuitive show or interesting thing you’ve learned while producing the show?\nHow do you keep current with the data engineering landscape?\nYou’ve got a very unique perspective of data engineering, having interviewed countless top people in the field. What are the the big trends you see in data engineering over the next 3 years?\nWhat do you do besides podcasting? Is this your only gig, or do you do other work?\nwhats next?\n\nContact Info\n\nLinkedIn\nWebsite\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPodcast.__init__\nThe Machine Learning Podcast\nTernary Data\nFundamentals of Data Engineering book (affiliate link)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time.

\n

Announcements

\n\n

Interview

\n\n

Joe’s Notes

\n\n

Contact Info

\n\n

Closing Announcements

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"Joe Reis takes over the show and interviews Tobias Macey, host of the Data Engineering Podcast, about his own show and the other projects that keep him busy","date_published":"2022-07-17T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1cd7f699-08d7-44f3-b231-f21e4c74ba12.mp3","mime_type":"audio/mpeg","size_in_bytes":46240662,"duration_in_seconds":3399}]},{"id":"podlove-2022-07-10t20:53:42+00:00-99b9e2219acfa11","title":"Charting the Path of Riskified's Data Platform Journey","url":"https://www.dataengineeringpodcast.com/riskified-data-platform-journey-episode-306","content_text":"Summary\nBuilding a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nTired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!\nYour host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat does Riskified do?\nCan you describe the role of data at Riskified?\n\nWhat are some of the core types and sources of information that you are dealing with?\nWho/what are the primary consumers of the data that you are responsible for?\n\n\nWhat are the team structures that you have tested for your data professionals?\n\nWhat is the composition of your data roles? (e.g. ML engineers, data engineers, data scientists, data product managers, etc.)\n\n\nWhat are the organizational constraints that have the biggest impact on the design and usage of your data systems?\nCan you describe the current architecture of your data platform?\n\nWhat are some of the most notable evolutions/redesigns that you have gone through?\n\n\nWhat is your process for establishing and evaluating selection criteria for any new technologies that you adopt?\n\nHow do you facilitate knowledge sharing between data professionals?\n\n\nWhat have you found to be the most challenging technological and organizational complexities that you have had to address on the path to your current state?\nWhat are the methods that you use for staying up to date with the data ecosystem? (opportunity to discuss Haya Data conference)\nIn your role as organizers of the Haya Data conference, what are some of the insights that you have gained into the present state and future trajectory of the data community?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Riskified data platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the data platform for Riskified?\nWhat do you have planned for the future of your data platform?\n\nContact Info\n\nInbar\n\nLinkedIn\n\n\nLior\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nRiskified\nADABAS\nAerospike\n\nPodcast Episode\n\n\nNeo4J\nKafka\nDelta Lake\n\nPodcast Episode\n\n\nDatabricks\nSnowflake\n\nPodcast Episode\n\n\nTableau\nLooker\n\nPodcast Episode\n\n\nRedshift\nEvent Sourcing\nAvro\nhayaData Conference\nData Mesh\nData Catalog\nData Governance\nMLOps\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Inbar Yogev and Lior Winner about their work on the data platform for Riskified and how they approach the technical and organizational strategies that are necessary for success","date_published":"2022-07-10T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fc2b3677-08a2-48ae-9933-31a83d7eb2a9.mp3","mime_type":"audio/mpeg","size_in_bytes":31901534,"duration_in_seconds":2397}]},{"id":"podlove-2022-07-10t20:31:09+00:00-7202a5bfdd8ee64","title":"Maintain Your Data Engineers' Sanity By Embracing Automation","url":"https://www.dataengineeringpodcast.com/automation-for-data-engineers-with-chris-riccomini-episode-305","content_text":"Summary\nBuilding and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense?\nWhat are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in?\nThe application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent?\nWhat are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate?\nAs someone with a software engineering background and extensive experience working in data, what are the missing links to make those teams/objectives work together more seamlessly?\n\nHow can tooling and automation help in that endeavor?\n\n\nA key factor in the adoption of automation for application delivery is automated tests. What are some of the strategies you find useful for identifying scope and targets for testing/monitoring of data products?\nAs data usage and capabilities grow and evolve in an organization, what are the junction points that are in greatest need of well-defined data contracts?\n\nHow can automation aid in enforcing and alerting on those contracts in a continuous fashion?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen automation of data operations used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on automation for data systems?\nWhen is automation the wrong choice?\nWhat does the future of data engineering look like?\n\nContact Info\n\nWebsite\n@criccomini on Twitter\ncriccomini on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nWePay\nEnterprise Service Bus\nThe Missing README\nHadoop\nConfluent Schema Registry\n\nPodcast Episode\n\n\nAvro\nCDC == Change Data Capture\nDebezium\n\nPodcast Episode\n\n\nData Mesh\nWhat the heck is a data mesh? blog post\nSRE == Site Reliability Engineer\nTerraform\nChef configuration management tool\nPuppet configuration management tool\nAnsible configuration management tool\nBigQuery\nAirflow\nPulumi\n\nPodcast.__init__ Episode\n\n\nMonte Carlo\n\nPodcast Episode\n\n\nBigeye\n\nPodcast Episode\n\n\nAnomalo\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\nSchemata\nData Engineering Weekly newsletter\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Chris Riccomini about the benefits and challenges of adopting automation for data engineering workflows and how to incorporate data contracts between teams and systems","date_published":"2022-07-10T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5f0af8ad-555d-4eba-84d2-1c86e0b781b7.mp3","mime_type":"audio/mpeg","size_in_bytes":54033744,"duration_in_seconds":3908}]},{"id":"podlove-2022-07-03t20:38:50+00:00-480afab5e5b0b95","title":"Be Confident In Your Data Integration By Quickly Validating Matching Records With data-diff","url":"https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303","content_text":"Summary\nThe perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRandom data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nYour host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy and Simon Eskildsen about their work to open source the data diff utility that they have been building at Datafold\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the data diff tool is and the story behind it?\n\nWhat was your motivation for going through the process of releasing your data diff functionality as an open source utility?\n\n\nWhat are some of the ways that data-diff composes with other data quality tools? (e.g. Great Expectations, Soda SQL, etc.)\nCan you describe how data-diff is implemented?\n\nGiven the target of having a performant and scalable utility how did you approach the question of language selection?\n\n\nWhat are some of the ways that you have seen data-diff incorporated in the workflow of data teams?\nWhat were the steps that you needed to do to get the project cleaned up and separated from your internal implementation for release as open source?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data-diff used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data-diff?\nWhen is data-diff the wrong choice?\nWhat do you have planned for the future of data-diff?\n\nContact Info\n\nGleb\n\nLinkedIn\n@glebmm on Twitter\n\n\nSimon\n\nWebsite\n@Sirupsen on Twitter\nsirupsen on GitHub\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nDatafold\n\nPodcast Episode\n\n\ndata-diff\nAutodesk\nAirbyte\n\nPodcast Episode\n\n\nDebezium\n\nPodcast Episode\n\n\nNapkin Math newsletter\nAirflow\nDagster\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nTrino\nPreql\n\nPodcast.__init__ Episode\n\n\nErez Shinan\nFivetran\n\nPodcast Episode\n\n\nmd5\nCRC32\nMerkle Tree\nLocally Optimistic\nPresto\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSpecial Guest: Gleb Mezhanskiy.","content_html":"

Summary

\n

The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Special Guest: Gleb Mezhanskiy.

","summary":"An interview with Gleb Mezhanskiy and Simon Eskildsen about how the open source data-diff utility can quickly and reliably validate data between your source and destination systems so that you can be confident that everything is working as intended.","date_published":"2022-07-03T16:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d0042300-5a77-4f94-aad4-dfe5ba6d24a3.mp3","mime_type":"audio/mpeg","size_in_bytes":65484471,"duration_in_seconds":4257}]},{"id":"podlove-2022-07-03t20:47:22+00:00-f4a555a7a7583d8","title":"The View From The Lakehouse Of Architectural Patterns For Your Data Platform","url":"https://www.dataengineeringpodcast.com/starburst-lakehouse-modern-data-architecture-episode-304","content_text":"Summary\nThe ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nTired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!\nYour host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\n\nIn your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name?\n\n\nCan you describe what you see as the dominant factors that influence a team’s approach to data architecture and design?\nTwo of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the \"modern data stack\" and the \"data mesh\". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability?\nWhat do you see as the strengths of the emerging lakehouse architecture in the context of the \"modern data stack\"?\n\nWhat are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.)\nWhat are the recent developments that are contributing to its current growth?\nWhat are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations)\n\n\nWhat are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems?\n\nWhat are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term?\n\n\nOne of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that?\n\nWhat are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products?\nWhat are some of the supporting services that are helpful in these undertakings?\n\n\nWhat do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years?\nWhat are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product?\nWhen is a lakehouse the wrong choice?\nWhat do you have planned for the future of Starburst’s technology platform?\n\nContact Info\n\nLinkedIn\n@ctartow on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nStarburst\nTrino\nTeradata\nCognos\nData Lakehouse\nData Virtualization\nIceberg\n\nPodcast Episode\n\n\nHudi\n\nPodcast Episode\n\n\nDelta\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nAWS Lake Formation\nClickhouse\n\nPodcast Episode\n\n\nDruid\nPinot\n\nPodcast Episode\n\n\nStarburst Galaxy\nVarada\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Colleen Tartow about the forces that are shaping the current generation of data architectures and how the underlying technologies are evolving to support patterns including data mesh and the modern data stack","date_published":"2022-07-03T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1f6e1024-00b6-4b7a-a85f-d61fe0ed7b94.mp3","mime_type":"audio/mpeg","size_in_bytes":48649025,"duration_in_seconds":3523}]},{"id":"podlove-2022-06-27t01:29:46+00:00-b09ab72ec4ebeab","title":"Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform","url":"https://www.dataengineeringpodcast.com/unfolded-geospatial-analytics-platform-episode-302","content_text":"Summary\nThe proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nUnstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.\nYour host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working with spatial data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the Unfolded platform is and the story behind it?\nWhat are some of the core challenges of working with spatial data?\n\nWhat are some of the sources that organizations rely on for collecting or generating those data sets?\n\n\nWhat are the capabilities that the Unfolded platform offers for spatial analytics?\n\nWhat use cases are you primarily focused on supporting?\nWhat (if any) are the datasets or analyses that you are consciously not investing in supporting?\n\n\nCan you describe how the Unfolded platform is implemented?\n\nHow have the design and goals shifted or evolved since you started working on Unfolded?\nWhat are the new constraints or opportunities that are available after the merger with Foursquare?\n\n\nCan you describe a typical workflow for someone using Unfolded to manage their spatial information and build an analysis on top of it?\n\nWhat are some of the data modeling considerations that are necessary when populating a custom data set with Unfolded?\n\n\nWhat are some of the techniques that you needed to build to allow for loading large data sets into a users’s browser while maintaining sufficient performance?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Unfolded used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Unfolded?\nWhen is Unfolded the wrong choice?\nWhat do you have planned for the future of Unfolded?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nUnfolded Platform\nH3 Hexagonal Map Tiles Library\nCarto\nMapbox\nOpen Street Map\nRaster Files\nHex Tiles\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Unstruk: ![Unstruck Data](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/J3_WeYmj.png)\r\n\r\nUnstruk Data offers an API-driven solution to simplify the process of transforming unstructured data files into actionable intelligence about real-world assets without writing a line of code – putting insights generated from this data at enterprise teams’ fingertips. The company was founded in 2021 by Kirk Marple after his tenure as CTO of Kespry. Kirk possesses extensive industry knowledge including over 25 years of experience building and architecting scalable SaaS platforms and applications, prior successful startup exits, and deep unstructured and perception data experience. Unstruk investors include 8VC, Preface Ventures, Valia Ventures, Shell Ventures and Stage Venture Partners.\r\n\r\nGo to <u>[dataengineeringpodcast.com/unstruk](https://www.dataengineeringpodcast.com/unstruk)</u> today to transform your messy collection of unstructured data files into actionable assets that power your business!","content_html":"

Summary

\n

The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Isaac Brodsky about the Unfolded platform for geospatial analytics and how you can use it to gain insight into seemingly disparate data sets that share spatial relationships","date_published":"2022-06-26T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/01b04731-c949-4682-ab5e-dcdbf8eeb64c.mp3","mime_type":"audio/mpeg","size_in_bytes":54890036,"duration_in_seconds":4021}]},{"id":"podlove-2022-06-26t12:40:18+00:00-5619b493cfde665","title":"Strategies And Tactics For A Successful Master Data Management Implementation","url":"https://www.dataengineeringpodcast.com/profisee-master-data-management-episode-301","content_text":"Summary\nThe most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRandom data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Malcolm Hawker about master data management strategies for the enterprise\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving your definition of what MDM is and the scope of activities/functions that it includes?\n\nHow have evolutions in the data landscape shifted the conversation around MDM?\n\n\nCan you describe what Profisee is and the story behind it?\n\nWhat was your path to joining Profisee and what is your role in the business?\n\n\nWho are the target customers for Profisee?\n\nWhat are the challenges that they typically experience that leads them to MDM as a solution for their problems?\n\n\nHow does the narrative around data observability/data quality from tools such as Great Expectations, Monte Carlo, etc. differ from the data quality benefits of a MDM strategy?\nHow do recent conversations around semantic/metrics layers compare to the way that MDM approaches the problem of domain modeling?\nWhat are the steps to defining an MDM strategy for an organization or business unit?\n\nOnce there is a strategy, what are the tactical elements of the implementation?\nWhat is the role of the toolchain in that implementation? (e.g. Spark, dbt, Airflow, etc.)\n\n\nCan you describe how Profisee is implemented?\n\nHow does the customer base inform the architectural approach that Profisee has taken?\n\n\nCan you describe the adoption process for an organization that is using Profisee for their MDM?\nOnce an organization has defined and adopted an MDM strategy, what are the ongoing maintenance tasks related to the domain models?\nWhat are the most interesting, innovative, or unexpected ways that you have seen MDM used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working in MDM?\nWhen is Profisee the wrong choice?\nWhat do you have planned for the future of Profisee?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nProfisee\nMDM == Master Data Management\nCRM == Customer Relationship Management\nERP == Enterprise Resource Planning\nLevenshtein Distance Algorithm\nSoundex\nCDP == Customer Data Platform\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Malcolm Hawker about the technical and organizational strategies that are required for a successful implementation of Master Data Management, as well as when and why it is necessary.","date_published":"2022-06-26T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f4161137-4ba5-4db0-ad93-7acdb9e54896.mp3","mime_type":"audio/mpeg","size_in_bytes":60097880,"duration_in_seconds":4148}]},{"id":"podlove-2022-06-19t23:53:23+00:00-c4df04fc1e0f37c","title":"Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas","url":"https://www.dataengineeringpodcast.com/canvas-app-spreadsheets-modern-data-stack-episode-300","content_text":"Summary\nData analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nUnstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.\nYour host is Tobias Macey and today I’m interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Canvas is and the story behind it?\nThe \"modern data stack\" has enabled organizations to analyze unparalleled volumes of data. What are the shortcomings in the operating model that keeps business users dependent on engineers to answer their questions?\nWhy is the spreadsheet such a popular and persistent metaphor for working with data?\n\nWhat are the biggest issues that existing spreadsheet software run up against as they scale both technically and organizationally?\n\n\nWhat are the new metaphors/design elements that you needed to develop to extend the existing capabilities and use cases of spreadsheets while keeping them familiar?\nCan you describe how the Canvas platform is implemented?\n\nHow have the design and goals of the product changed/evolved since you started working on it?\n\n\nWhat is the workflow for a business user that is using Canvas to iterate on a series of questions?\nWhat are the collaborative features that you have built into Canvas and who are they for? (e.g. other business users, data engineers <-> business users, etc.)\nWhat are the situations where the spreadsheet abstraction starts to break down?\n\nWhat are the extension points/escape hatches that you have built into the product for when that happens?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Canvas used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Canvas?\nWhen is Canvas the wrong choice?\nWhat do you have planned for the future of Canvas?\n\nContact Info\n\nLinkedIn\n@ryanjbuick on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nCanvas\nFlexport\n\nPodcast Episode about their data mesh implementation\n\n\nExcel\nLightdash\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nFigma\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ryan Buick about the Canvas platform and how it combines the approachability of spreadsheets with the power of modern data systems to reduce the barrier to analysis for everyone.","date_published":"2022-06-19T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/277267a3-323b-4af4-9cf9-f37efbb1bf51.mp3","mime_type":"audio/mpeg","size_in_bytes":35820134,"duration_in_seconds":2578}]},{"id":"podlove-2022-06-19t23:45:20+00:00-a39341307058dfc","title":"Level Up Your Data Platform With Active Metadata","url":"https://www.dataengineeringpodcast.com/atlan-active-metadata-episode-299","content_text":"Summary\nMetadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nYour host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of \"active metadata\" and the work that she and her team at Atlan are doing to make it a reality\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what \"active metadata\" is and how it differs from the current approaches to metadata systems?\nWhat are some of the use cases that \"active metadata\" can enable for data producers and consumers?\n\nWhat are the points of friction that those users encounter in the current formulation of metadata systems?\n\n\nCentral metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the \"modern data stack\" that can be applied to centralized metadata?\nCan you describe the approach that you are taking at Atlan to enable the adoption of \"active metadata\"?\n\nWhat are the architectural capabilities that you had to build to power the outbound traffic flows?\n\n\nHow are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan?\n\nWhat are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered?\nHow does the type/category of metadata impact the type of integration that is necessary?\n\n\nWhat are some of the automation possibilities that metadata activation offers for data teams?\n\nWhat are the cases where you still need a human in the loop?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users?\nWhen is an active approach to metadata the wrong choice?\nWhat do you have planned for the future of Atlan and active metadata?\n\nContact Info\n\nLinkedIn\n@prukalpa on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nAtlan\nWhat is Active Metadata?\nSegment\n\nPodcast Episode\n\n\nZapier\nArgoCD\nKubernetes\nWix\nAWS Lambda\nModern Data Culture Blog Post\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform","date_published":"2022-06-19T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/41f31ba7-25dc-42f5-ab4e-8f1e64fc2b20.mp3","mime_type":"audio/mpeg","size_in_bytes":44240261,"duration_in_seconds":3155}]},{"id":"podlove-2022-06-13t02:18:29+00:00-0324c3bd1b9b2c5","title":"Discover And De-Clutter Your Unstructured Data With Aparavi","url":"https://www.dataengineeringpodcast.com/aparavi-unstructured-data-management-episode-297","content_text":"Summary\nUnstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Aparavi is and the story behind it?\nWho are the target customers for Aparavi and how does that inform your product roadmap and messaging?\nWhat are some of the insights that you are able to provide about an organization’s data?\n\nOnce you have generated those insights, what are some of the actions that they typically catalyze?\n\n\nWhat are the types of storage and data systems that you integrate with?\nCan you describe how the Aparavi platform is implemented?\n\nHow do the trends in cloud storage and data systems influence the ways that you evolve the system?\n\n\nCan you describe a typical workflow for an organization using Aparavi?\nWhat are the mechanisms that you use for categorizing data assets?\n\nWhat are the interfaces that you provide for data owners and operators to provide heuristics to customize classification/cataloging of data?\n\n\nHow can teams integrate with Aparavi to expose its insights to other tools for uses such as automation or data catalogs?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Aparavi used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Aparavi?\nWhen is Aparavi the wrong choice?\nWhat do you have planned for the future of Aparavi?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nAparavi\nSHA-512\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Rod Christensen about how the Aparavi platform automates the discovery and management of unstructured data in your organization to cut down on cost and clutter","date_published":"2022-06-12T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1a069a78-9fb1-4962-9f7c-620a5986722b.mp3","mime_type":"audio/mpeg","size_in_bytes":31767343,"duration_in_seconds":2952}]},{"id":"podlove-2022-06-13t02:26:37+00:00-d3da68fe470bb74","title":"Hire And Scale Your Data Team With Intention","url":"https://www.dataengineeringpodcast.com/trupti-natu-data-team-growth-episode-298","content_text":"Summary\nBuilding a well rounded and effective data team is an iterative process, and the first hire can set the stage for future success or failure. Trupti Natu has been the first data hire multiple times and gone through the process of building teams across the different stages of growth. In this episode she shares her thoughts and insights on how to be intentional about establishing your own data team.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nUnstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.\nYour host is Tobias Macey and today I’m interviewing Trupti Natu about strategies for building your team, from the first data hire to post-acquisition\n\nInterview\n\nIntroduction\nHow did you get involved in the area of FinTech & Data Science (management)?\nHow would you describe your overall career trajectory in data?\nCan you describe what your experience has been as a data professional at different stages of company growth?\nWhat are the traits that you look for in a first or second data hire at an organization?\n\nWhat are useful metrics for success to help gauge the effectiveness of hires at this early stage of data capabilities?\n\n\nWhat are the broad goals and projects that early data hires should be focused on?\n\nWhat are the indicators that you look for to determine when to scale the team?\n\n\nAs you are building a team of data professionals, what are the organizational topologies that you have found most effective? (e.g. centralized vs. embedded data pros, etc.)\nWhat are the recruiting and screening/interviewing techniques that you have found most helpful given the relative scarcity of experienced data practitioners?\nWhat are the organizational and technical structures that are helpful to establish early in the organization’s data journey to reduce the onboarding time for new hires?\nYour background has primarily been in FinTech. How does the business domain influence the types of background and domain expertise that you look for?\nYou recently went through an acquisition at the startup you were with. Can you describe the data-related projects that were required during the merger?\n\nWhat are the impedance mismatches that you have had to resolve in your data systems, moving from a fast-moving startup into a larger, more established organization?\nBeing a FinTech company, what are some of the categories of regulatory considerations that you had to deal with during the integration process?\n\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned along your career journey?\nWhat are some of the pieces of advice that you wished you knew at the beginning of your career, and that you would like to share with others in that situation?\n\nContact Info\n\nLinkedIn\n@truptinatu on Twitter\nTrupti is hiring for multiple product data science roles. Feel free to DM her on Twitter or LinkedIn to find out more\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nSumoLogic\nFinTech\nPII == Personally Identifiable Information\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a well rounded and effective data team is an iterative process, and the first hire can set the stage for future success or failure. Trupti Natu has been the first data hire multiple times and gone through the process of building teams across the different stages of growth. In this episode she shares her thoughts and insights on how to be intentional about establishing your own data team.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about the challenges involved in hiring your first data professional and scaling up to a full data team while consistently delivering value and retaining talented engineers.","date_published":"2022-06-12T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3ca26041-20f3-4fe7-ba8f-7f1232601de4.mp3","mime_type":"audio/mpeg","size_in_bytes":50515461,"duration_in_seconds":3653}]},{"id":"podlove-2022-06-06t02:00:07+00:00-99715afa1e23e61","title":"Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault","url":"https://www.dataengineeringpodcast.com/skyflow-data-privacy-vault-episode-296","content_text":"Summary\nThe best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turn-key\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Skyflow is and the story behind it?\nWhat is a \"data privacy vault\" and how does it differ from strategies such as privacy engineering or existing data governance patterns?\nWhat are the primary use cases and capabilities that you are focused on solving for with Skyflow?\n\nWho is the target customer for Skyflow (e.g. how does it enter an organization)?\n\n\nHow is the Skyflow platform architected?\n\nHow have the design and goals of the system changed or evolved over time?\n\n\nCan you describe the process of integrating with Skyflow at the application level?\nFor organizations that are building analytical capabilities on top of the data managed in their applications, what are the interactions with Skyflow at each of the stages in the data lifecycle?\nOne of the perennial problems with distributed systems is the challenge of joining data across machine boundaries. How do you mitigate that problem?\nOn your website there are different \"vaults\" advertised in the form of healthcare, fintech, and PII. What are the different requirements across each of those problem domains?\n\nWhat are the commonalities?\n\n\nAs a relatively new company in an emerging product category, what are some of the customer education challenges that you are facing?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Skyflow used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Skyflow?\nWhen is Skyflow the wrong choice?\nWhat do you have planned for the future of Skyflow?\n\nContact Info\n\nLinkedIn\n@seanfalconer on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nSkyflow\nPrivacy Engineering\nData Governance\nHomomorphic Encryption\nPolymorphic Encryption\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Sean Falconer about the value of introducing a data privacy vault as an architectural component of you applications and data systems to reduce the complexity involved with keeping your sensitive information secure and protected.","date_published":"2022-06-05T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/11eccbc2-b72e-4744-8682-74fcdaba3051.mp3","mime_type":"audio/mpeg","size_in_bytes":44886678,"duration_in_seconds":3244}]},{"id":"podlove-2022-06-06t01:52:04+00:00-837ebccfc95f42d","title":"Bringing The Modern Data Stack To Everyone With Y42","url":"https://www.dataengineeringpodcast.com/y42-full-stack-data-platform-episode-295","content_text":"Summary\nCloud services have made highly scalable and performant data platforms economical and manageable for data teams. However, they are still challenging to work with and manage for anyone who isn’t in a technical role. Hung Dang understood the need to make data more accessible to the entire organization and created Y42 as a better user experience on top of the \"modern data stack\". In this episode he shares how he designed the platform to support the full spectrum of technical expertise in an organization and the interesting engineering challenges involved.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Hung Dang about Y42, the full-stack data platform that anyone can run\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Y42 is and the story behind it?\nHow would you characterize your positioning in the data ecosystem?\nWhat are the problems that you are trying to solve?\n\nWho are the personas that you optimize for and how does that manifest in your product design and feature priorities?\n\n\nHow is the Y42 platform implemented?\n\nWhat are the core engineering problems that you have had to address in order to tie together the various underlying services that you integrate?\nHow have the design and goals of the product changed or evolved since you started working on it?\n\n\nWhat are the sharp edges and failure conditions that you have had to automate around in order to support non-technical users?\nWhat is the process for integrating Y42 with an organization’s data systems?\n\nWhat is the story for onboarding from existing systems and importing workflows (e.g. Airflow dags and dbt models)?\n\n\nWith your recent shift to using Git as the store of platform state, how do you approach the problem of reconciling branched changes with side effects from changes (e.g. creating tables or mutating table structures in the warehouse)?\nCan you describe a typical workflow for building or modifying a business dashboard or activating data in the warehouse?\nWhat are the interfaces and abstractions that you have built into the platform to support collaboration across roles and levels of experience? (technical or organizational)\nWith your focus on end-to-end support for data analysis, what are the extension points or escape hatches for use cases that you can’t support out of the box?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Y42 used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Y42?\nWhen is Y42 the wrong choice?\nWhat do you have planned for the future of Y42?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nY42\nCDTM (Center for Digital Technology and Management)\nMeltano\n\nPodcast Episode\n\n\nAirflow\nSinger\ndbt\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nGrouparoo\n\nPodcast Episode\n\n\nTerraform\nOpenTelemetry\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:PostHog: ![Post Hog](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/K-hligJW.png)\r\n\r\nPostHog is an open source, product analytics platform. PostHog enables software teams to understand user behavior – auto-capturing events, performing product analytics and dashboarding, enabling video replays, and rolling out new features behind feature flags, all based on their single open source platform. The product’s open source approach enables companies to self-host, removing the need to send data externally. Try it out today at <u>[dataengineeringpodcast.com/posthog](https://www.dataengineeringpodcast.com/posthog)</u>","content_html":"

Summary

\n

Cloud services have made highly scalable and performant data platforms economical and manageable for data teams. However, they are still challenging to work with and manage for anyone who isn’t in a technical role. Hung Dang understood the need to make data more accessible to the entire organization and created Y42 as a better user experience on top of the "modern data stack". In this episode he shares how he designed the platform to support the full spectrum of technical expertise in an organization and the interesting engineering challenges involved.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Hung Dang about how the Y42 platform packages the modern data stack in a way that allows anyone to run it and gain value from their data","date_published":"2022-06-05T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4162e8ec-f8a1-43d4-8ea7-f168d120c61d.mp3","mime_type":"audio/mpeg","size_in_bytes":45799968,"duration_in_seconds":3541}]},{"id":"podlove-2022-05-30t01:23:27+00:00-dd4329a9d05344c","title":"Data Cloud Cost Optimization With Bluesky Data","url":"https://www.dataengineeringpodcast.com/bluesky-data-cloud-cost-optimization-episode-293","content_text":"Summary\nThe latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Mingsheng Hong and Zheng Shao about Bluesky Data where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Bluesky is and the story behind it?\n\nWhat are the platforms/technologies that you are focused on in your current early stage?\nWhat are some of the other targets that you are considering once you validate your initial hypothesis?\n\n\nCloud cost optimization is an active area for application infrastructures as well. What are the corollaries and differences between compute and storage optimization strategies and what you are doing at Bluesky?\nHow have your experiences at hyperscale companies using various combinations of cloud and on-premise data platforms informed your approach to the cost management problem faced by adopters of cloud data systems?\nWhat are the most significant drivers of cost in cloud data systems?\n\nWhat are the factors (e.g. pricing models, organizational usage, inefficiencies) that lead to such inflated costs?\n\n\nWhat are the signals that you collect for identifying targets for optimization and tuning?\nCan you describe how the Bluesky mission control platform is architected?\n\nWhat are the current areas of uncertainty or active research that you are focused on?\n\n\nWhat is the workflow for a team or organization that is adding Bluesky to their system?\n\nHow does the usage of Bluesky change as teams move from the initial optimization and dramatic cost reduction into a steady state?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen teams approaching cost management in the absence of Bluesky?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Bluesky?\nWhen is Bluesky the wrong choice?\nWhat do you have planned for the future of Bluesky?\n\nContact Info\n\nMingsheng\n\nLinkedIn\n@mingshenghong on Twitter\n\n\nZheng\n\nLinkedIn\n@zshao9 on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nBluesky Data\n\nGet A Free Health Check For Your Snowflake From Bluesky\n\n\nRocksDB\nSnowflake\n\nPodcast Episode\n\n\nTrino\n\nPodcast Episode\n\n\nFirebolt\n\nPodcast Episode\n\n\nBigquery\nHive\nVertica\nMichael Stonebraker\nTeradata\nC-Store Paper\nOttertune\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\ninfracost\nSubtract: The Untapped Science of Less by Leidy Klotz\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of Bluesky Data about their approach to data cloud cost optimization for your Snowflake warehouse and the impact of usage based billing on organizational experimentation","date_published":"2022-05-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/00c22396-1a67-4816-b3fe-cb2d5d85329a.mp3","mime_type":"audio/mpeg","size_in_bytes":42548514,"duration_in_seconds":3804}]},{"id":"podlove-2022-05-30t01:30:29+00:00-65edc5aa33a7c86","title":"A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore","url":"https://www.dataengineeringpodcast.com/a-multipurpose-database-for-transactions-and-analytics-to-simplify-your-data-architecture-with-singlestore","content_text":"Summary\nA large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nData teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.\nYour host is Tobias Macey and today I’m interviewing Shireesh Thota about Singlestore (formerly MemSQL), the industry’s first modern relational database for multi-cloud, hybrid and on-premises workloads\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SingleStore is and the story behind it?\nThe database market has gotten very crouded, with different areas of specialization and nuance being the differentiating factors. What are the core sets of workloads that SingleStore is aimed at addressing?\n\nWhat are some of the capabilities that it offers to reduce the need to incorporate multiple data stores for application and analytical architectures?\n\n\nWhat are some of the most valuable lessons that you learned in your time at MicroSoft that are applicable to SingleStore’s product focus and direction?\nNikita Shamgunov joined the show in October of 2018 to talk about what was then MemSQL. What are the notable changes in the engine and business that have occurred in the intervening time?\n\nWhat are the macroscopic trends in data management and application development that are having the most impact on product direction?\n\n\nFor engineering teams that are already invested in, or considering adoption of, the \"modern data stack\" paradigm, where does SingleStore fit in that architecture?\n\nWhat are the services or tools that might be replaced by an installation of SingleStore?\n\n\nWhat are the efficiencies or new capabilities that an engineering team might expect by adopting SingleStore?\nWhat are some of the features that are underappreciated/overlooked which you would like to call attention to?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SingleStore used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SingleStore?\nWhen is SingleStore the wrong choice?\nWhat do you have planned for the future of SingleStore?\n\nContact Info\n\nLinkedIn\n@ShireeshThota on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMemSQL Interview With Nikita Shamgunov\nSinglestore\nMS SQL Server\nAzure Cosmos DB\nCitusDB\n\nPodcast Episode\n\n\nDebezium\n\nPodcast Episode\n\n\nPostgreSQL\n\nPodcast Episode\n\n\nMySQL\nHTAP == Hybrid Transactional-Analytical Processing\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform","date_published":"2022-05-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3a57f0ac-4355-46fe-a7ba-10a271c32327.mp3","mime_type":"audio/mpeg","size_in_bytes":33629276,"duration_in_seconds":2481}]},{"id":"podlove-2022-05-23t00:48:06+00:00-8b9f0cbec50d5d0","title":"Unlocking The Value Of Data Across The Organization Through User Friendly Data Tools With Prophecy","url":"https://www.dataengineeringpodcast.com/prophecy-low-code-user-experience-episode-292","content_text":"Summary\nThe interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nYour host is Tobias Macey and today I’m interviewing Raj Bains about how improving the user experience for data tools can make your work as a data engineer better and easier\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the broad categories of data tool designs that are available currently and how does that impact what is possible with them?\n\nWhat are the points of friction that are introduced by the tools?\nCan you share some of the types of workarounds or wasted effort that are made necessary by those design elements?\n\n\nWhat are the core design principles that you have built into Prophecy to address these shortcomings?\n\nHow do those user experience changes improve the quality and speed of work for data engineers?\n\n\nHow has the Prophecy platform changed since we last spoke almost a year ago?\nWhat are the tradeoffs of low code systems for productivity vs. flexibility and creativity?\nWhat are the most interesting, innovative, or unexpected approaches to developer experience that you have seen for data tools?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on user experience optimization for data tooling at Prophecy?\nWhen is it more important to optimize for computational efficiency over developer productivity?\nWhat do you have planned for the future of Prophecy?\n\nContact Info\n\nLinkedIn\n@_raj_bains on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nProphecy\n\nPodcast Episode\n\n\nCUDA\nClustrix\nHortonworks\nApache Hive\nCompilerworks\n\nPodcast Episode\n\n\nAirflow\nDatabricks\nFivetran\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nStreamsets\nChange Data Capture\nApache Pig\nSpark\nScala\nAb Initio\nType 2 Slowly Changing Dimensions\nAWS Deequ\nMatillion\n\nPodcast Episode\n\n\nProphecy SaaS\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Raj Bains about how the Prophecy low-code platform for data engineering reduces the burden on everyone by making data workflows more intuitive and understandable.","date_published":"2022-05-22T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0f9a6427-9e76-43a8-a3d2-5f3a084ac65d.mp3","mime_type":"audio/mpeg","size_in_bytes":62522679,"duration_in_seconds":4256}]},{"id":"podlove-2022-05-23t00:33:22+00:00-10d870ab862bcf2","title":"Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte","url":"https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291","content_text":"Summary\nMachine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nData lake architectures provide the best combination of massive scalability and cost reduction, but they aren’t always the most performant option. That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine they detect patterns from your critical queries, automatically build data marts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost-based query router will give you interactive speeds across petabyte scale data sets for BI dashboards and ad-hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/kyligence\nYour host is Tobias Macey and today I’m interviewing Ketan Umare and Haytham Abuelfutuh about Flyte, the open source and kubernetes-native orchestration engine for your data systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Flyte is and the story behind it?\nWhat was missing in the ecosystem of available tools that made it necessary/worthwhile to create Flyte?\nWorkflow orchestrators have been around for several years and have gone through a number of generational shifts. How would you characterize Flyte’s position in the ecosystem?\n\nWhat do you see as the closest alternatives?\nWhat are the core differentiators that might lead someone to choose Flyte over e.g. Airflow/Prefect/Dagster?\n\n\nWhat are the core primitives that Flyte exposes for building up complex workflows?\n\nMachine learning use cases have been a core focus since the project’s inception. What are some of the ways that that manifests in the design and feature set?\n\n\nCan you describe the architecture of Flyte?\n\nHow have the design and goals of the platform changed/evolved since you first started working on it?\n\n\nWhat are the changes in the data ecosystem that have had the most substantial impact on the Flyte project? (e.g. roadmap, integrations, pushing people toward adoption, etc.)\nWhat is the process for setting up a Flyte deployment?\nWhat are the user personas that you prioritize in the design and feature development for Flyte?\nWhat is the workflow for someone building a new pipeline in Flyte?\n\nWhat are the patterns that you and the community have established to encourage discovery and reuse of granular task definitions?\nBeyond code reuse, how can teams scale usage of Flyte at the company/organization level?\n\n\nWhat are the affordances that you have created to facilitate local development and testing of workflows while ensuring a smooth transition to production?\n\nWhat are the patterns that are available for CI/CD of workflows using Flyte?\n\n\nHow have you approached the design of data contracts/type definitions to provide a consistent/portable API for defining inter-task dependencies across languages?\nWhat are the available interfaces for extending Flyte and building integrations with other components across the data ecosystem?\nData orchestration engines are a natural point for generating and taking advantage of rich metadata. How do you manage creation and propagation of metadata within and across the framework boundaries?\nLast year you founded Union to offer a managed version of Flyte. What are the features that you are offering beyond what is available in the open source?\n\nWhat are the opportunities that you see for the Flyte ecosystem with a corporate entity to invest in expanding adoption?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Flyte used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Flyte?\nWhen is Flyte the wrong choice?\nWhat do you have planned for the future of Flyte?\n\nContact Info\n\nKetan Umare\nHaytham Abuelfutuh\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nFlyte\n\nSlack Channel\n\n\nUnion.ai\nKubeflow\nAirflow\nAWS Step Functions\nProtocol Buffers\nXGBoost\nMLFlow\nDagster\n\nPodcast Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nArrow\nParquet\nMetaflow\nPytorch\n\nPodcast.__init__ Episode\n\n\ndbt\nFastAPI\n\nPodcast.__init__ Interview\n\n\nPython Type Annotations\nModin\n\nPodcast.__init__ Interview\n\n\nMonad\nDatahub\n\nPodcast Episode\n\n\nOpenMetadata\n\nPodcast Episode\n\n\nHudi\n\nPodcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\nPandera\nUnion ML\nWeights and Biases\nWhylogs\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Kyligence: ![Kyligence](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/krLMJxWU.png)\r\n\r\nKyligence was founded in 2016 by the original creators of Apache Kylin™, the leading open source OLAP for Big Data. Kyligence offers an Intelligent OLAP Platform to simplify multidimensional analytics for cloud data lake. Its AI-augmented engine detects patterns from most frequently asked business queries, builds governed data marts automatically and brings metrics accountability on the data lake to optimize data pipeline and avoid excessive number of tables. It provides a unified SQL interface between the cloud object store, cubes, indexes and underlying data sources with a cost-based smart query router for business intelligence, ad-hoc analytics and data services at PB-scale.\r\n\r\nKyligence is trusted by global leaders in financial services, manufacturing and retail industries including UBS, China Construction Bank, China Merchants Bank, Pingan Bank, MetLife, Costa and Appzen. With technology partnership with Microsoft, Amazon, Tableau and Huawei, Kyligence is on a mission to simplify and govern data lakes to be productive for critical business analytics and data services.\r\n\r\nKyligence is dual headquartered in San Jose, CA, United States and Shanghai, China, and is backed by leading investors including Redpoint Ventures, Cisco, Broadband Capital, Shunwei Capital, Eight Roads Ventures, Coatue Management, SPDB International, CICC, Gopher Assets, Guofang Capital, ASG, Jumbo Sheen Fund, and Puxin Capital.\r\n\r\nGo to <u>[dataengineeringpodcast.com/kyligence](https://www.dataengineeringpodcast.com/kyligence)</u>today to find out more.","content_html":"

Summary

\n

Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Ketan Umare and Haytham Abuelfutuh about the Flyte open source and cloud native data orchestration engine for managing the machine learning lifecycle and their work at Union to make it more approachable","date_published":"2022-05-22T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/51b0f11f-f4a9-4221-87d9-25e94f765eb7.mp3","mime_type":"audio/mpeg","size_in_bytes":54620420,"duration_in_seconds":4027}]},{"id":"podlove-2022-05-16t01:36:38+00:00-2ce8d670c7916f5","title":"Designing And Deploying IoT Analytics For Industrial Applications At Vopak","url":"https://www.dataengineeringpodcast.com/vopak-industrial-iot-analytics-platform-episode-290","content_text":"Summary\nIndustrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nYour host is Tobias Macey and today I’m interviewing Mário Pereira about building a data management system for globally distributed IoT sensors at Vopak\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Vopak is and what kinds of information you rely on to power the business?\nWhat kinds of sensors and edge devices are you using?\n\nWhat kinds of consistency or variance do you have between sensors across your locations?\n\n\nHow much computing power and storage space do you place at the edge?\n\nWhat level of pre-processing/filtering is being done at the edge and how do you decide what information needs to be centralized?\nWhat are some examples of decision-making that happens at the edge?\n\n\nCan you describe the platform architecture that you have built for collecting and processing sensor data?\n\nWhat was your process for selecting and evaluating the various components?\n\n\nHow much tolerance do you have for missed messages/dropped data?\nHow long are your data retention periods and what are the factors that influence that policy?\nWhat are some statistics related to the volume, variety, and velocity of your data?\n\nWhat are the end-to-end latency requirements for different segments of your data?\n\n\nWhat kinds of analysis are you performing on the collected data?\nWhat are some of the potential ramifications of failures in your system? (e.g. spills, explosions, spoilage, contamination, revenue loss, etc.)\nWhat are some of the scaling issues that you have experienced as you brought your system online?\nHow have you been managing the decision making prior to implementing these technology solutions?\nWhat are the new capabilities and business processes that are enabled by this new platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen your data capabilities applied?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on building an IoT collection and aggregation platform at global scale?\nWhat do you have planned for the future of your IoT system?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nVopak\nSwinging Door Compression Algorithm\nIoT Greengrass\nOPCUA IoT protocol\nMongoDB\nAWS Kinesis\nAWS Batch\nAWS IoT Sitewise Edge\nBoston Dynamics\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Mário Pereira about how he and his team are building and scaling an industrial IoT analytics and reporting platform at Vopak, a global company responsible for storing and distributing the liquids that power modern society","date_published":"2022-05-15T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c3838080-de64-4b20-bf9f-399da25a741d.mp3","mime_type":"audio/mpeg","size_in_bytes":39516288,"duration_in_seconds":2874}]},{"id":"podlove-2022-05-16t01:31:21+00:00-36ccb203c46eb31","title":"Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way","url":"https://www.dataengineeringpodcast.com/data-lake-platform-design-srivatsan-sridharan-episode-289","content_text":"Summary\nDesigning a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.\nYour host is Tobias Macey and today I’m interviewing Srivatsan Sridharan about the technological, staffing, and design considerations for building a data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what your experience has been with designing and implementing data platforms?\nWhat are the elements that you have found to be common requirements across organizations and data characteristics?\nWhat are the architectural elements that require the most detailed consideration based on organizational needs and data requirements?\nHow has the ecosystem for building maintainable and usable data lakes matured over the past few years?\n\nWhat are the elements that are still cumbersome or intractable?\n\n\nThe streaming ecosystem has also gone through substantial changes over the past few years. What is your synopsis of the meaningful differences between todays options and where we were ~6 years ago?\nHow did your experiences at Yelp inform your current architectural approach at Robinhood?\nCan you describe your current platform architecture?\n\nWhat are the primary capabilities that you are optimizing for?\n\n\nWhat is your evaluation process for determining what components to use in your platform?\n\nHow do you approach the build vs. buy problem and quantify the tradeoffs?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen your data systems used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing data platforms across your career?\nWhen is a data lake architecture the wrong choice?\nWhat do you have planned for the future of the data platform at Robinhood?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nRobinhood\nYelp\nKafka\nSpark\nFlink\n\nPodcast Episode\n\n\nPulsar\n\nPodcast Episode\n\n\nParquet\nChange Data Capture\nDelta Lake\n\nPodcast Episode\n\n\nHudi\n\nPodcast Episode\n\n\nRedshift\nBigQuery\nInformatica\nData Mesh\n\nPodcast Episode\n\n\nPrestoDB\nTrino\nAirbyte\n\nPodcast Episode\n\n\nMeltano\n\nPodcast Episode\n\n\nFivetran\n\nPodcast Episode\n\n\nStitch\nPinot\n\nPodcast Episode\n\n\nClickhouse\n\nPodcast Episode\n\n\nDruid\nIceberg\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Srivatsan Sridharan, head of data infrastructure at Robinhood, about his experiences building data lake platforms at Robinhood and Yelp, and how to account for organizational and technical needs while moving the business forward.","date_published":"2022-05-15T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3dbf61aa-cbb1-4249-af7b-a069f9001089.mp3","mime_type":"audio/mpeg","size_in_bytes":37275655,"duration_in_seconds":3490}]},{"id":"podlove-2022-05-09t01:50:55+00:00-65dd1c247965c46","title":"Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data","url":"https://www.dataengineeringpodcast.com/dan-delorey-data-career-episode-288","content_text":"Summary\nDan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nYour host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\n\n\nCan you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there?\n\n\nDremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google?\n\n\nYou were instrumental in crafting the vision behind \"querying data in place,\" (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach?\n\nHow well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper?\n\n\n\nFollowing your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem?\n\n\nHow have your experiences at Google influenced your approach to platform and organizational design at SoFi?\n\n\nWhat’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house?\n\n\nHow does your team solve for data quality and governance?\n\nWhat are the dominating factors that you consider when deciding on project/product priorities for your team?\n\n\n\nWhen you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there?\n\n\nYou also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space?\n\n\nWhat are the most interesting, innovative, or unexpected data systems that you have encountered?\n\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems?\n\n\nWhat are the areas that you are paying the most attention to?\n\n\nWhat interesting predictions do you have for the future of data systems and their applications?\n\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nSoFi\nBigquery\nDremel\nBrigham Young University\nEmpirical Software Engineering\nMap/Reduce\nHadoop\nSawzall\nVLDB Test Of Time Award Paper\nGFS\nColossus\nPartitioned Hash Join\nGoogle BigTable\nHBase\nAWS Athena\nSnowflake\n\nPodcast Episode\n\n\nData Vault\nStar Schema\nPrivacy Vault\nHomomorphic Encryption\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Dan Delorey about his career, from helping to build and scale Dremel at Google, to managing BigQuery in its early days, to his current role as the VP of Data at SofI with many interesting insights along the way","date_published":"2022-05-08T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1bf11d4c-530a-4d9b-b9c0-80f23b57bb62.mp3","mime_type":"audio/mpeg","size_in_bytes":40339745,"duration_in_seconds":3651}]},{"id":"podlove-2022-05-09t01:35:45+00:00-f044432306ee701","title":"Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database","url":"https://www.dataengineeringpodcast.com/tigergraph-graph-database-episode-287","content_text":"Summary\nMany of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.\nYour host is Tobias Macey and today I’m interviewing Jon Herke about TigerGraph, a distributed native graph database\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what TigerGraph is and the story behind it?\nWhat are some of the core use cases that you are focused on supporting?\nHow has TigerGraph changed over the past 4 years since I spoke with Todd Blaschka at the Open Data Science Conference?\nHow has the ecosystem of graph databases changed in usage and design in recent years?\nWhat are some of the persistent areas of confusion or misinformation that you encounter when explaining graph databases and TigerGraph to potential users?\nThe tagline on your website says that TigerGraph is \"The Only Scalable Graph Database for the Enterprise\". Can you unpack that claim and explain what is necessary for a graph database to be suitable for enterprise use?\nWhat are some of the typical application and system architectures that you typically see for end-users of TigerGraph? (e.g. polyglot persistence, etc.)\nWhat are the cases where TigerGraph should be the system of record as opposed to an optimization option for addressing highly connected data?\nWhat are the data modeling considerations that end-users should be thinking of when planning their storage structures in TigerGraph?\nWhat are the most interesting, innovative, or unexpected ways that you have seen TigerGraph used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on TigerGraph?\nWhen is TigerGraph the wrong choice?\nWhat do you have planned for the future of TigerGraph?\n\nContact Info\n\nLinkedIn\n@jonherke on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nTigerGraph\nGraphQL\nKafka\nGQL (Graph Query Language)\nLDBC (Linked Data Benchmark Council)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Jon Herke about the capabilities that TigerGraph provides for modeling and analyzing highly connected datasets in a fully managed native graph database engine.","date_published":"2022-05-08T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aedd9ea1-85ba-454e-94ae-e27072ecb628.mp3","mime_type":"audio/mpeg","size_in_bytes":31770025,"duration_in_seconds":2395}]},{"id":"podlove-2022-05-02t00:59:28+00:00-e5bbdebb481efe0","title":"Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion","url":"https://www.dataengineeringpodcast.com/matillion-cloud-data-integration-episode-286","content_text":"Summary\nThe predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.\nYour host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Matillion is and the story behind it?\nWhat are the use cases and user personas that you are focused on supporting?\n\nHow does that influence the focus and pace of your feature development and priorities?\n\n\nHow is Matillion architected?\n\nHow have the design and goals of the system changed since you started working on it?\n\n\nThe ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same?\nWhat have been the most challenging integrations to build and support?\nWhat is a typical workflow for integrating Matillion into an organization and building a set of pipelines?\n\nWhat are some of the patterns that have been useful for managing incidental complexity as usage scales?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Matillion used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion?\nWhen is Matillion the wrong choice?\nWhat do you have planned for the future of Matillion?\n\nContact Info\n\nLinkedIn\nMatillion Contact\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMatillion\n\nTwitter\n\n\nIBM DB2\nCognos\nTalend\nRedshift\nAWS Marketplace\nAWS Re:Invent\nAzure\nGCP == Google Cloud Platform\nInformatica\nSSIS == SQL Server Integration Services\nPCRE == Perl Compatible Regular Expressions\nTeradata\nTomcat\nCollibra\nAlation\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ed Thompson, CTO of Matillion, about building a sustainable business and powering low maintenance data integration pipelines in the cloud","date_published":"2022-05-01T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/59fba319-ff1a-4bf6-954c-be61b478a9e9.mp3","mime_type":"audio/mpeg","size_in_bytes":37249364,"duration_in_seconds":3199}]},{"id":"podlove-2022-05-02t00:09:17+00:00-6b4f04f9a1c5013","title":"Evolving And Scaling The Data Platform at Yotpo","url":"https://www.dataengineeringpodcast.com/yotpo-data-platform-architecture-episode-285","content_text":"Summary\nBuilding a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Yotpo is and the role that data plays in the organization?\nWhat are the core data types and sources that you are working with?\n\nWhat kinds of data assets are being produced and how do those get consumed and re-integrated into the business?\n\n\nWhat are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with?\n\nWhat is the size of your team and how is it structured?\n\n\nYou recently posted about the current architecture of your data platform. What was the starting point on your platform journey?\n\nWhat did the early stages of feature and platform evolution look like?\nWhat was the catalyst for making a concerted effort to integrate your systems into a cohesive platform?\n\n\nWhat was the scope and directive of the project for building a platform?\n\nWhat are the metrics and capabilities that you are optimizing for in the structure of your data platform?\nWhat are the organizational or regulatory constraints that you needed to account for?\n\n\nWhat are some of the early decisions that affected your available choices in later stages of the project?\nWhat does the current state of your architecture look like?\n\nHow long did it take to get to where you are today?\n\n\nWhat were the factors that you considered in the various build vs. buy decisions?\n\nHow did you manage cost modeling to understand the true savings on either side of that decision?\n\n\nIf you were to start from scratch on a new data platform today what might you do differently?\nWhat are the decisions that proved helpful in the later stages of your platform development?\nWhat are the most interesting, innovative, or unexpected ways that you have seen your platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform?\nWhat do you have planned for the future of your platform infrastructure?\n\nContact Info\n\nDoron\n\nLinkedIn\n\n\nLiran\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nYotpo\n\nData Platform Architecture Blog Post\n\n\nGreenplum\nDatabricks\nMetorikku\nApache Hive\nCDC == Change Data Capture\nDebezium\n\nPodcast Episode\n\n\nApache Hudi\n\nPodcast Episode\n\n\nUpsolver\n\nPodcast Episode\n\n\nSpark\nPrestoDB\nSnowflake\n\nPodcast Episode\n\n\nDruid\nRockset\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nAcryl\n\nPodcast Episode\n\n\nAtlan\n\nPodcast Episode\n\n\nOpenLineage\n\nPodcast Episode\n\n\nOkera\nShopify Data Warehouse Episode\nRedshift\nDelta Lake\n\nPodcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nOutbox Pattern\nBackstage\nRoadie\nNomad\nKubernetes\nDeequ\nGreat Expectations\n\nPodcast Episode\n\n\nLakeFS\n\nPodcast Episode\n\n\n2021 Recap Episode\nMonte Carlo\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Doron Porat and Liran Yogev about the organizational constraints and technological requirements that shaped the current architecture of the data platform at Yotpo","date_published":"2022-05-01T20:15:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c16d1ca8-5f45-48fd-90c8-2e71e535748b.mp3","mime_type":"audio/mpeg","size_in_bytes":49566867,"duration_in_seconds":3850}]},{"id":"podlove-2022-04-24t22:17:25+00:00-d6d8c5d1bddf519","title":"Operational Analytics At Speed With Minimal Busy Work Using Incorta","url":"https://www.dataengineeringpodcast.com/incorta-fast-operational-analytics-episode-284","content_text":"Summary\nA huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.\nYour host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Incorta is and the story behind it?\nWhat are the use cases and customers that you are focused on?\n\nHow does that focus inform the design and priorities of functionality in the product?\n\n\nWhat are the technologies and workflows that Incorta might replace?\n\nWhat are the systems and services that it is intended to integrate with and extend?\n\n\nCan you describe how Incorta is implemented?\n\nWhat are the core technological decisions that were necessary to make the product successful?\nHow have the design and goals of the system changed and evolved since you started working on it?\n\n\nCan you describe the workflow for building an end-to-end analysis using Incorta?\n\nWhat are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem?\n\n\nHow do the features of Incorta influence the approach that teams take for data modeling?\nWhat are the points of collaboration and overlap between organizational roles while using Incorta?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Incorta used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta?\nWhen is Incorta the wrong choice?\nWhat do you have planned for the future of Incorta?\n\nContact Info\n\nLinkedIn\n@layereddelay on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nIncorta\n3rd Normal Form\nParquet\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nPrestoDB\nPySpark\nDataiku\nAngular\nReact\nApache ECharts\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Matthew Halliday about how Incorta uses in-memory processing to power real-time operational analytics and reduce the amount of work involved in gaining insights from your data","date_published":"2022-04-24T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0ec0a4b4-f188-498f-b2df-1ad7893ed0b3.mp3","mime_type":"audio/mpeg","size_in_bytes":63475080,"duration_in_seconds":4276}]},{"id":"podlove-2022-04-23t22:14:15+00:00-7259545e133bc0b","title":"Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs","url":"https://www.dataengineeringpodcast.com/whylogs-data-logging-data-observability-episode-283","content_text":"Summary\nThere are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Andy Dang about powering observability of AI systems with the whylogs data logging library\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Whylabs is and the story behind it?\nHow is \"data logging\" differentiated from logging for the purpose of debugging and observability of software logic?\nWhat are the use cases that you are aiming to support with Whylogs?\n\nHow does it compare to libraries and services like Great Expectations/Monte Carlo/Soda Data/Datafold etc.\n\n\nCan you describe how Whylogs is implemented?\n\nHow have the design and goals of the project changed or evolved since you started working on it?\n\n\nHow do you maintain feature parity between the Python and Java integrations?\nHow do you structure the log events and metadata to provide detail and context for data applications?\n\nHow does that structure support aggregation and interpretation/analysis of the log information?\n\n\nWhat is the process for integrating Whylogs into an existing project?\n\nOnce you have the code instrumented with log events, what is the workflow for using Whylogs to debug and maintain a data application?\n\n\nWhat have you found to be useful heuristics for identifying what to log?\nWhat are some of the strategies that teams can use to maintain a balance of signal vs. noise in the events that they are logging?\nHow is the Whylogs governance set up and how are you approaching sustainability of the open source project?\nWhat are the additional utilities and services that you anticipate layering on top of/integrating with Whylogs?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Whylogs used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Whylabs?\nWhen is Whylogs/Whylabs the wrong choice?\nWhat do you have planned for the future of Whylabs?\n\nContact Info\n\nLinkedIn\n@andy_dng on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nWhylogs\nWhylabs\nSpark\nAirflow\nPandas\n\nPodcast Episode\n\n\nData Sketches\nGrafana\nGreat Expectations\n\nPodcast Episode\n\n\nMonte Carlo\n\nPodcast Episode\n\n\nSoda Data\n\nPodcast Episode\n\n\nDatafold\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nHyperLogLog\nMLFlow\nFlyte\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Andy Dang about the open source WhyLogs library and how it simplifies the work of data logging for instrumenting your machine learning workflows and unlocking observability.","date_published":"2022-04-24T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/87a4a065-a1bc-423c-910d-622bdcec5f4a.mp3","mime_type":"audio/mpeg","size_in_bytes":46035833,"duration_in_seconds":3543}]},{"id":"podlove-2022-04-16t21:37:18+00:00-c0c0b7c6c326e2a","title":"Connecting To The Next Frontier Of Computing With Quantum Networks","url":"https://www.dataengineeringpodcast.com/aliro-quantum-networking-episode-282","content_text":"Summary\nThe next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nYour host is Tobias Macey and today I’m interviewing Dr. Prineha Narang about her work at Aliro building quantum networking technologies and how it impacts the capabilities of data systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Aliro is and the story behind it?\nWhat are the use cases that you are focused on?\nWhat is the impact of quantum networks on distributed systems design? (what limitations does it remove?)\nWhat are the failure modes of quantum networks?\n\nHow do they differ from classical networks?\n\n\nHow can network technologies bridge between classical and quantum connections and where do those transitions happen?\n\nWhat are the latency/bandwidth capacities of quantum networks?\nHow does it influence the network protocols used during those communications?\n\nHow much error correction is necessary during the quantum communication stages of network transfers?\n\n\n\n\nHow does quantum computing technology change the landscape for AI technologies?\n\nHow does that impact the work of data engineers who are building the systems that power the data feeds for those models?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen quantum technologies used for data systems?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Aliro and your academic research?\nWhen are quantum technologies the wrong choice?\nWhat do you have planned for the future of Aliro and your research efforts?\n\nContact Info\n\nLinkedIn\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAliro Quantum\nHarvard University\nCalTech\nQuantum Computing\nQuantum Repeater\nARPANet\nTrapped Ion Quantum Computer\nPhotonic Computing\nSDN == Software Defined Networking\nQPU == Quantum Processing Unit\nIEEE\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

The next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with Prineha Narang about what quantum networks are, why they are an important development in the future of the internet, and how you can start preparing to take advantage of them.","date_published":"2022-04-17T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6775a3f4-4eea-4f94-b612-0506ea8365b0.mp3","mime_type":"audio/mpeg","size_in_bytes":28655189,"duration_in_seconds":2423}]},{"id":"podlove-2022-04-16t18:58:36+00:00-c680e72764b662b","title":"What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?","url":"https://www.dataengineeringpodcast.com/mlops-for-data-engineers-episode-281","content_text":"Summary\nPutting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nYour host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what MLOps is?\n\nHow does it relate to DataOps? DevOps? (is it just another buzzword?)\n\n\nWhat is your interest and involvement in the space of MLOps?\nWhat are the open and active questions in the MLOps community?\nWho is responsible for MLOps in an organization?\n\nWhat is the role of the data engineer in that process?\n\n\nWhat are the core capabilities that are necessary to support an \"MLOps\" workflow?\nHow do the current platform technologies support the adoption of MLOps workflows?\n\nWhat are the areas that are currently underdeveloped/underserved?\n\n\nCan you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices?\nWhat are some of the common requirements for supporting ML workflows?\n\nWhat are some of the ways that requirements become bespoke to a given organization or project?\n\n\nWhat are the opportunities for standardization or consolidation in the tooling for MLOps?\n\nWhat are the pieces that are always going to require custom engineering?\n\n\nWhat are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on supporting the MLOps community?\nWhat are your predictions for the future of MLOps?\n\nWhat are you keeping a close eye on?\n\n\n\nContact Info\n\nDemetrios\n\nLinkedIn\n@Dpbrinkm on Twitter\nMedium\n\n\nDavid\n\nLinkedIn\n@aponteanalytics on Twitter\naponte411 on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nMLOps Community\nEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (affiliate link)\nMLOps\nDataOps\nDevOps\nThe Sequence Newsletter\nNeptune.ai\nAlgorithmia\nKubeflow\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Demetrios Brinkmann and David Aponte about the role of MLOps principles when building machine learning systems and how the data engineer can help make it sustainable.","date_published":"2022-04-16T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/62d3dd72-05c3-4fd9-9e52-fbc757d975d5.mp3","mime_type":"audio/mpeg","size_in_bytes":51801773,"duration_in_seconds":4553}]},{"id":"podlove-2022-04-10t00:56:06+00:00-db99671cdf9ce99","title":"DataOps As A Service For Your Data Integration Workflows With Rivery","url":"https://www.dataengineeringpodcast.com/rivery-dataops-as-a-service-episode-290","content_text":"Summary\nData engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nAre you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!\nYour host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration, and Data Operations\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Rivery is and the story behind it?\nWhat are the primary goals of Rivery as a platform and company?\nWhat are the target personas for the Rivery platform?\n\nWhat are the points of interaction/workflows for each of those personas?\nWhat are some of the positive and negative sources of inspiration that you looked to while deciding on the scope of the platform?\n\n\nThe majority of recently formed companies are focused on narrow and composable concerns of data management. What do you see as the shortcomings of that approach?\n\nWhat are some of the tradeoffs between integrating independent tools vs buying into an ecosystem?\n\n\nHow is the Rivery platform designed and implemented?\n\nHow have the design and goals of the platform changed or evolved since you began working on it?\nWhat were your criteria for the MVP that would allow you to test your hypothesis?\n\n\nHow has the evolution of the ecosystem influenced your product strategy?\nOne of the interesting features that you offer is the catalog of \"kits\" to quickly set up common workflows. How do you manage regression/integration testing for those kits as the Rivery platform evolves?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Rivery used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Rivery?\nWhen is Rivery the wrong choice?\nWhat do you have planned for the future of Rivery?\n\nContact Info\n\nLinkedIn\n@ItamarBenHemo on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nRivery\nMatillion\nBigQuery\nSnowflake\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nFivetran\n\nPodcast Episode\n\n\nSnowpark\nPostman\nDebezium\n\nPodcast Episode\n\n\nSnowflake Partner Connect\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Itamar Ben Hemo about the Rivery platform for automated dataops for your ELT workflows and how it is architected to accelerate your time to value.","date_published":"2022-04-10T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/637bcb41-3b47-4174-966b-4880a3a41e28.mp3","mime_type":"audio/mpeg","size_in_bytes":49380234,"duration_in_seconds":3484}]},{"id":"podlove-2022-04-10t00:52:07+00:00-925148579471b7d","title":"Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel","url":"https://www.dataengineeringpodcast.com/gretel-privacy-engineering-episode-279","content_text":"Summary\nAny time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nAre you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nYour host is Tobias Macey and today I’m interviewing John Myers about privacy engineering and use cases for synthetic data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Gretel is and the story behind it?\nHow do you define \"privacy engineering\"?\n\nIn an organization or data team, who is typically responsible for privacy engineering?\n\n\nHow would you characterize the current state of the art and adoption for privacy engineering?\nWho are the target users of Gretel and how does that inform the features and design of the product?\nWhat are the stages of the data lifecycle where Gretel is used?\nCan you describe a typical workflow for integrating Gretel into data pipelines for business analytics or ML model training?\nHow is the Gretel platform implemented?\n\nHow have the design and goals of the system changed or evolved since you started working on it?\n\n\nWhat are some of the nuances of synthetic data generation or masking that data engineers/data analysts need to be aware of as they start using Gretel?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Gretel used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Gretel?\nWhen is Gretel the wrong choice?\nWhat do you have planned for the future of Gretel?\n\nContact Info\n\nLinkedIn\n@jtm_tech on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nGretel\nPrivacy Engineering\nWeights and Biases\nRed Team/Blue Team\nGenerative Adversarial Network\nCapture The Flag in application security\nCVE == Common Vulnerabilities and Exposures\nMachine Learning Cold Start Problem\nFaker\nMockaroo\nKaggle\nSentry\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with John Myers, CTO of Gretel Labs, about the complexities of privacy engineering and how they are reducing the friction involved in securing your datasets for safer sharing.","date_published":"2022-04-10T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c273d442-15f1-40f7-a299-46548cd9e90d.mp3","mime_type":"audio/mpeg","size_in_bytes":37821869,"duration_in_seconds":2912}]},{"id":"podlove-2022-04-03t22:15:07+00:00-7429e9c693e1438","title":"Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder","url":"https://www.dataengineeringpodcast.com/coalesce-enterprise-analytics-transformations-episode-278","content_text":"Summary\nThe flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nAre you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!\nYour host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Coalesce and the story behind it?\nWhat are the core problems that you are focused on solving with Coalesce?\nThe platform appears to be fairly opinionated in the workflow. What are the design principles and philosophies that you have embedded into the user experience?\nCan you describe how Coalesce is implemented?\nWhat are the pitfalls in data architecture patterns that you commonly see organizations fall prey to?\n\nHow do the pre-built transformation templates in Coalesce help to guide users in a more maintainable direction?\n\n\nThe platform is currently tied to Snowflake as the underlying engine. How much effort will it be to expand your integrations and the scope of Coalesece?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Coalesce used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Coalesce?\nWhen is Coalesce the wrong choice?\nWhat do you have planned for the future of Coalesce?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nCoalesce\nData Warehouse Toolkit\nWherescape\ndbt\n\nPodcast Episode\n\n\nType 2 Dimensions\nFirebase\nKubernetes\nStar Schema\nData Vault\n\nPodcast Episode\n\n\nData Mesh\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Satish Jayanthi about how the Coalesce platform powers enterprise analytics and accelerates their time to insight for workflows in the data warehouse.","date_published":"2022-04-03T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/04e07bdd-6d5b-4c4f-bb3f-622229d65977.mp3","mime_type":"audio/mpeg","size_in_bytes":30761320,"duration_in_seconds":2565}]},{"id":"podlove-2022-04-03t21:46:57+00:00-b467da21ccb6fad","title":"Repeatable Patterns For Designing Data Platforms And When To Customize Them","url":"https://www.dataengineeringpodcast.com/red-ventures-data-platform-design-episode-277","content_text":"Summary\nBuilding a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nHey Data Engineering Podcast listeners, want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join our live webinar on April 20th. Joybird director of analytics, Brett Trani, will walk through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Visit www.rudderstack.com/joybird?utm_source=rss&utm_medium=rss to register today.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Brandon Beidel about his data platform journey at Red Ventures\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Red Ventures is and your role there?\n\nGiven the relative newness of data product management, where do you draw inspiration and direction for how to approach your work?\n\n\nWhat are the primary categories of data product that your data consumers are building/relying on?\nWhat are the types of data sources that you are working with to power those downstream use cases?\nCan you describe the size and composition/organization of your data team(s)?\nHow do you approach the build vs. buy decision while designing and evolving your data platform?\nWhat are the tools/platforms/architectural and usage patterns that you and your team have developed for your platform?\n\nWhat are the primary goals and constraints that have contributed to your decisions?\nHow have the goals and design of the platform changed or evolved since you started working with the team?\n\n\nYou recently went through the process of establishing and reporting on SLAs for your data products. Can you describe the approach you took and the useful lessons that were learned?\nWhat are the technical and organizational components of the data work at Red Ventures that have proven most difficult?\nWhat excites you most about the future of data engineering?\nWhat are the most interesting, innovative, or unexpected ways that you have seen teams building more reliable data systems?\nWhat aspects of data tooling or processes are still missing for most data teams?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data products at Red Ventures?\nWhat do you have planned for the future of your data platform?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nRed Ventures\nMonte Carlo\nOpportunity Cost\ndbt\n\nPodcast Episode\n\n\nApache Ranger\nPrivacera\n\nPodcast Episode\n\n\nSegment\nFivetran\n\nPodcast Episode\n\n\nDatabricks\nBigquery\nRedshift\nHightouch\n\nPodcast Episode\n\n\nAirflow\nAstronomer\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nClickhouse\n\nPodcast Episode\n\n\nPresto\n\nPodcast Episode\n\n\nTrino\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Brandon Beidel about his experiences at Red Ventures designing and supporting analytical data platforms for his customers and how he and his team have established a set of useful patterns to make it scalable.","date_published":"2022-04-03T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/41ec8a6a-8f99-4774-a39e-36b6a5ee8bca.mp3","mime_type":"audio/mpeg","size_in_bytes":35913090,"duration_in_seconds":2822}]},{"id":"podlove-2022-03-27t17:48:36+00:00-3dd24aa16cdb5ae","title":"Eliminate The Bottlenecks In Your Key/Value Storage With SpeeDB","url":"https://www.dataengineeringpodcast.com/speedb-fast-key-value-store-episode-276","content_text":"Summary\nAt the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nTimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale\nYour host is Tobias Macey and today I’m interviewing Adi Gelvan about his work on SpeeDB, the \"next generation data engine\"\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SpeeDB is and the story behind it?\nWhat is your target market and customer?\n\nWhat are some of the shortcomings of RocksDB that these organizations are running into and how do they manifest?\n\n\nWhat are the characteristics of RocksDB that have led so many database engines to embed it or build on top of it?\n\nWhich of the systems that rely on RocksDB do you most commonly see running into its limitations?\n\n\nHow does the work you have done at SpeeDB compare to the efforts of the Terark project?\nCan you describe how you approached the work of identifying areas for improvement in RocksDB?\n\nWhat are some of the optimizations that you introduced?\nWhat are some tradeoffs that you deemed acceptable in the process of optimizing for speed and scale?\n\n\nWhat is the integration process for adopting SpeeDB?\n\nIn the event that an organization has a system with data resident in RocksDB, what is the migration process?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen SpeeDB used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SpeeDB?\nWhen is SpeeDB the wrong choice?\nWhat do you have planned for the future of SpeeDB?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nSpeeDB\nRocksDB\nTerarkDB\nEMC\nInfinidat\nLSM == Log-Structured Merge Tree\nB+ Tree\nLevelDB\nLMDB\nBloom Filter\nBadger\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

At the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Adi Gelvan about how he and his team re-engineered the RocksDB key/value storage engine for accelerated performance on high volume and high throughput workloads.","date_published":"2022-03-27T14:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/29ec3605-ad13-42ea-804f-e81e3cd9f1d6.mp3","mime_type":"audio/mpeg","size_in_bytes":36359774,"duration_in_seconds":2812}]},{"id":"podlove-2022-03-27t17:09:31+00:00-401107dc6fbb4b4","title":"Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera","url":"https://www.dataengineeringpodcast.com/privacera-enterprise-cloud-data-governance-episode-275","content_text":"Summary\nData governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nThe most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog\nYour host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Privacera is and the story behind it?\nWhat is your working definition of \"data governance\" and how does that influence your product focus and priorities?\nWhat are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera?\nHow would you characterize your position in the market for data governance/data security tools?\nWhat are the unique constraints and challenges that come into play when managing data in cloud platforms?\nCan you explain how the Privacera platform is architected?\n\nHow have the design and goals of the system changed or evolved since you started working on it?\n\n\nWhat is the workflow for an operator integrating Privacera into a data platform?\n\nHow do you provide feedback to users about the level of coverage for discovered data assets?\n\n\nHow does Privacera fit into the workflow of the different personas working with data?\n\nWhat are some of the security and privacy controls that Privacera introduces?\n\n\nHow do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Privacera used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera?\nWhen is Privacera the wrong choice?\nWhat do you have planned for the future of Privacera?\n\nContact Info\n\nLinkedIn\n@Balaji_Blog on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nPrivacera\nHadoop\nHortonworks\nApache Ranger\nOracle\nTeradata\nPresto/Trino\nStarburst\n\nPodcast Episode\n\n\nAhana\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSponsored By:Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png)\r\n\r\nThe modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at <u>[dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)</u>","content_html":"

Summary

\n

Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Sponsored By:

","summary":"An interview with Balaji Ganesan about how Privacera levels up the open source Apache Ranger project to bridge data governance from on premise datacenters to the cloud without compromise.","date_published":"2022-03-27T13:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f5dc65ad-1880-407a-8d62-27956e340cff.mp3","mime_type":"audio/mpeg","size_in_bytes":42311523,"duration_in_seconds":3755}]},{"id":"podlove-2022-03-20t17:34:26+00:00-4eac3e2b2bd8e9e","title":"Exploring Incident Management Strategies For Data Teams","url":"https://www.dataengineeringpodcast.com/data-incident-management-strategies-episode-274","content_text":"Summary\nData assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nAre you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!\nYour host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing some of the ways that an \"incident\" can manifest in a data system?\n\nAt a high level, what are the steps and participants required to bring an incident to resolution?\n\n\nThe principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams?\nWhat are the signals that teams should be monitoring to identify and alert on potential incidents?\n\nAlerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue and provide useful context in the alerts that do get sent?\n\nAnother aspect of this problem is the proper routing of alerts to ensure that the right person sees and acts on it. How have you seen teams deal with the challenge of delivering alerts to the right people?\n\n\n\n\nWhen there is an active incident, what are the steps that you commonly see data teams take to understand the cause and scope of the issue?\nHow can teams augment their systems to make incidents faster to resolve?\nWhat are the most interesting, innovative, or unexpected ways that you have seen teams approch incident response?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on incident management strategies?\nWhat are the aspects of incident management for data teams that are still missing?\n\nContact Info\n\nMei\n\n@tao_mei on Twitter\nEmail\n\n\nFrancisco\n\n@falberini on Twitter\nEmail\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMonte Carlo\n\nLearn more about RCA best practices\n\n\nSegment\n\nPodcast Episode\n\n\nSegment Protocols\nRedshift\nAirflow\ndbt\n\nPodcast Episode\n\n\nThe Goal by Eliahu Golratt\nData Mesh\n\nPodcast Episode\nFollow-Up Podcast Episode\n\n\nPagerDuty\nOpsGenie\nGrafana\nPrometheus\nSentry\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Mei Tao and Francisco Alberini about their experiences working with data teams as they adopt data observability and incident management strategies and how to start introducing those practices into your own work.","date_published":"2022-03-20T16:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/399e0f7f-013f-49fa-a613-c629b0a977d4.mp3","mime_type":"audio/mpeg","size_in_bytes":48255315,"duration_in_seconds":3445}]},{"id":"podlove-2022-03-20t17:02:18+00:00-9904e49505675d2","title":"Accelerate Your Embedded Analytics With Apache Pinot","url":"https://www.dataengineeringpodcast.com/pinot-embedded-analytics-episode-273","content_text":"Summary\nData and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nThis episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product today at dataengineeringpodcast.com/acryl\nYour host is Tobias Macey and today I’m interviewing Kishore Gopalakrishna and Xiang Fu about Apache Pinot and its applications for powering user-facing analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Pinot is and the story behind it?\nWhat are the primary use cases that Pinot is designed to support?\nThere are numerous OLAP engines available with varying tradeoffs and optimal use cases. What are the cases where Pinot is the preferred choice?\n\nHow does it compare to systems such as Clickhouse (for OLAP) or CubeJS/GoodData (for embedded analytics)?\n\n\nHow do the operational needs of a database engine change as you move from serving internal stakeholders to external end-users?\nCan you describe how Pinot is architected?\n\nWhat were the key design elements that were necessary to support low-latency queries with high concurrency?\n\n\nCan you describe a typical end-to-end architecture where Pinot will be used for embedded analytics?\n\nWhat are some of the tools/technologies/platforms/design patterns that Pinot might replace or obviate?\n\n\nWhat are some of the useful lessons related to data modeling that users of Pinot should consider?\n\nWhat are some edge cases that they might encounter due to details of how the storage layer is architected? (e.g. data tiering, tail latencies, etc.)\n\n\nWhat are some heuristics that you have developed for understanding how to manage data lifecycles in a user-facing analytics application?\nWhat are some of the ways that users might need to customize Pinot for their specific use cases and what options do they have for extending it?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Pinot used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Pinot?\nWhen is Pinot the wrong choice?\nWhat do you have planned for the future of Pinot?\n\nContact Info\n\nKishore\n\nLinkedIn\n@KishoreBytes on Twitter\n\n\nXiang\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nApache Pinot\nStarTree\nEspresso\nApache Helix\nApache Gobblin\nApache S4\nKafka\nLucene\nStarTree Index\nPresto\nTrino\nPulsar\n\nPodcast Episode\n\n\nSpark\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Kishore Gopalakrishna and Xiang Fu about how the Apache Pinot storage engine is designed to support low latency, high concurrency, and fast updates for powering end-user facing embedded analytics in your applications.","date_published":"2022-03-20T16:15:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/623d6bac-add1-4087-a24d-96a0034aa1b7.mp3","mime_type":"audio/mpeg","size_in_bytes":55044058,"duration_in_seconds":4376}]},{"id":"podlove-2022-03-14t00:32:40+00:00-7850dc222bef1e8","title":"Taking A Multidimensional Approach To Data Observability At Acceldata","url":"https://www.dataengineeringpodcast.com/acceldata-multidimensional-data-observability-episode-272","content_text":"Summary\nData observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nTimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale\nYour host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data?\nCan you describe what Acceldata is and the story behind it?\nWhat does it mean for a data observability platform to be \"multidimensional\"?\nHow do the architectural characteristics of the \"modern data stack\" influence the requirements and implementation of data observability strategies?\nThe data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports?\nWho are your target users and how does that focus influence the way that you have approached feature and design priorities?\nWhat are some of the ways that you are using the Acceldata platform to run Acceldata?\nCan you describe how the Acceldata platform is implemented?\n\nHow have the design and goals of the system changed or evolved since you started working on it?\n\n\nHow are you managing the definition, collection, and correlation of events across stages of the data lifecycle?\nWhat are some of the ways that performance data can feed back into the debugging and maintenance of an organization’s data ecosystem?\nWhat are the challenges that data platform owners face when trying to interpret the metrics and events that are available in a system like Acceldata?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Acceldata used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Acceldata?\nWhen is Acceldata the wrong choice?\nWhat do you have planned for the future of Acceldata?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nAcceldata\nSemantic Web\nHortonworks\ndbt\n\nPodcast Episode\n\n\nFirebolt\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tristan Spaulding, head of product at Acceldata, about the goals and challenges of data observability and how they are looking to differentiate themselves through a multidimensional approach to the problem (and what that means).","date_published":"2022-03-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/eaa1e008-f506-45d2-a944-2373ffa2947f.mp3","mime_type":"audio/mpeg","size_in_bytes":46787243,"duration_in_seconds":3797}]},{"id":"podlove-2022-03-14t00:24:16+00:00-d05503d536aa4ac","title":"Accelerating Adoption Of The Modern Data Stack At 5X Data","url":"https://www.dataengineeringpodcast.com/5x-data-modern-data-stack-episode-271","content_text":"Summary\nThe modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nYour host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are doing at 5x and the story behind it?\nHow has your focus and operating model shifted since we spoke a year ago?\n\nWhat are the biggest shifts in the market for data management that you have seen in that time?\n\n\nWhat are the main challenges that your customers are facing when they start working with you?\nWhat are the components that you are relying on to build repeatable data platforms for your customers?\n\nWhat are the sharp edges that you have had to smooth out to scale your implementation of those systems?\nWhat do you see as the white spaces that still exist in the offerings available for the \"modern data stack\"?\n\n\nWith the rapid introduction of so many new products in the data ecosystem, what are the categories that you see as being a long-term necessity?\n\nWhat are the areas that you predict will merge and consolidate over the next 3 – 5 years?\n\n\nWhat are the most interesting, innovative, or unexpected types of problems that you and your collaborators have had the opportunity to work on?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building the 5x organization?\nWhen is 5x the wrong choice?\nWhat do you have planned for the future of 5x?\n\nContact Info\n\nLinkedIn\n@tarush on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\n5X Data\n\nPodcast Interview\n\n\nSnowflake\n\nPodcast Interview\n\n\ndbt\n\nPodcast Interview\n\n\nFivetran\n\nPodcast Interview\n\n\nLooker\n\nPodcast Interview\n\n\nMatt Turck State of Data\nMixpanel\nAmplitude\nHeap\n\nPodcast Episode\n\n\nBigquery\nNarrator\n\nPodcast Episode\n\n\nMarquez\n\nPodcast Episode\n\n\nAtlan\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tarush Aggarwal about his work at 5X Data to help organizations adopt the modern data stack to advance their analytical capabilities and accelerate their business.","date_published":"2022-03-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6b4f656d-6eec-4468-928b-752b156bc189.mp3","mime_type":"audio/mpeg","size_in_bytes":39014987,"duration_in_seconds":3231}]},{"id":"podlove-2022-03-05t22:38:58+00:00-a04a08a9fd95f6f","title":"Move Your Database To The Data And Speed Up Your Analytics With DuckDB","url":"https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270","content_text":"Summary\nWhen you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nYour host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what DuckDB is and the story behind it?\nWhere did the name come from?\nWhat are some of the use cases that DuckDB is designed to support?\nThe interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?\n\nHow might they be used in concert to take advantage of their relative strengths?\n\n\nWhat are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems?\nCan you describe how DuckDB is implemented?\n\nHow has the design and goals of the project changed or evolved since you began working on it?\nWhat are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?\n\n\nCan you describe a typical workflow of incorporating DuckDB into an analytical project?\nWhat are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team?\nWhat are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight?\nWhat is the governance model and plan long-term sustainability of the project?\nWhat are the most interesting, innovative, or unexpected ways that you have seen DuckDB used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB?\nWhen is DuckDB the wrong choice?\nWhat do you have planned for the future of DuckDB?\n\nContact Info\n\nHannes Mühleisen\n@hfmuehleisen on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDuckDB\nCWI\nSQLite\nOLAP == Online Analytical Processing\nDuck Typing\nZODB\nTeradata\nHTAP == Hybrid Transactional/Analytical Processing\nPandas\n\nPodcast.__init__ Episode\n\n\nApache Arrow\nJulia Language\nVoltron Data\nParquet\nThrift\nProtobuf\nVectorized Query Processor\nLLVM\nDuckDB Labs\nDuckDB Foundation\nMIT Open Courseware (OCW)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Hannes Mühleisen about the DuckDB engine for in-process OLAP queries that lets you use the power of SQL and the flexibility of programming languages side by side.","date_published":"2022-03-05T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6b3766ee-9132-4e90-b5db-da98daad3584.mp3","mime_type":"audio/mpeg","size_in_bytes":58804944,"duration_in_seconds":4622}]},{"id":"podlove-2022-03-05t14:38:12+00:00-6e5fcd1f1d47b84","title":"Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB","url":"https://www.dataengineeringpodcast.com/harperdb-developer-friendly-database-episode-269","content_text":"Summary\nDatabases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nAre you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!\nYour host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale across edge and cloud environments\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what HarperDB is and the story behind it?\nThere has been an explosion of database engines over the past 5 – 10 years, with each entrant offering specific capabilities. What are the use cases that HarperDB is focused on addressing?\nWhat are the issues that you experienced with existing database engines that led to the creation of HarperDB?\n\nIn what ways does HarperDB address those issues?\n\n\nWhat are some of the ways that the focus on developers has influenced the interfaces and features of HarperDB?\nWhat is your view on the role of the database in the near to medium future?\nCan you describe how HarperDB is implemented?\n\nHow have the design and goals changed from when you first started working on it?\n\n\nOne of the common difficulties in document oriented databases is being able to conduct performant joins. What are the considerations that users need to be aware of as they are designing their data models?\nWhat are some examples of deployment topologies that HarperDB can support given the pub/sub replication model?\nWhat are some of the data modeling/database design strategies that users of HarperDB should know in order to take full advantage of its capabilities?\n\nWith the dynamic schema capabilities allowing developers to add attributes and mutate the table structure at any point, what are the options for schema enforcment? (e.g. add an integer attribute and another record tries to write a string to that attribute location)\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen HarperDB used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on HarperDB?\nWhen is HarperDB the wrong choice?\nWhat do you have planned for the future of HarperDB?\n\nContact Info\n\nLinkedIn\n@sgoldberg on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nHarperDB\n\n@harperdbio on Twitter\n\n\nMulesoft\nZapier\nLMDB\nSocketIO\nSocketCluster\nMongoDB\nCouchDB\nPostgreSQL\nVoltDB\nHeroku\nSAP/Hana\nNodeJS\nDynamoDB\nCockroachDB\n\nPodcast Episode\n\n\nFastify\nHTAP == Hybrid Transactional Analytical Processing\nSplunk\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Stephen Goldberg, CEO of HarperDB, about how he and his team are building a fast, scalable, and developer friendly database engine that supports edge, cloud, and datacenter environments.","date_published":"2022-03-05T13:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4efb5db6-eda4-4eae-b164-750373b6af00.mp3","mime_type":"audio/mpeg","size_in_bytes":37878025,"duration_in_seconds":2973}]},{"id":"podlove-2022-02-28t02:24:20+00:00-0bb825ebe127063","title":"Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise","url":"https://www.dataengineeringpodcast.com/komprise-unstructured-data-management-episode-267","content_text":"Summary\nThere are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.\nYour host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Komprise is and the story behind it?\nWho are the target customers of the Komprise platform?\n\nWhat are the core use cases that you are focused on supporting?\n\n\nHow would you characterize the common approaches to managing file storage solutions for hybrid cloud environments?\n\nWhat are some of the shortcomings of the enterprise storage providers’ methods for managing storage tiers when trying to use that data for analytical workloads?\n\n\nGiven the growth in popularity and capabilities of cloud solutions, how have you approached the strategic positioning of your product to capitalize on the market?\nCan you describe how the Komprise platform is architected?\n\nWhat are some of the most complex considerations that you have had to engineer for when dealing with enterprise data distribution in hybrid cloud environments?\n\n\nWhat are the data replication and consistency guarantees that you are able to offer while spanning across on-premise and cloud systems/block and object storage? (e.g. eventual consistency vs. read-after-write, low latency replication on data changes vs. scheduled syncing, etc.)\nHow do you determine and validate the heuristics that you use for understanding how/when to distribute files across storage systems?\nHow does the specific workload that you are powering influence the specific operations/capabilities that your customers take advantage of?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Komprise used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Komprise?\nWhen is Komprise the wrong choice?\nWhat do you have planned for the future of Komprise?\n\nContact Info\n\nLinkedIn\n@cloudKrishna on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nKomprise\nUnstruk\n\nPodcast Episode\n\n\nSMB\nNFS\nS3\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Krishna Subramanian about how Komprise is addressing the challenge of managing unstructured data assets across operating environments without losing your sanity.","date_published":"2022-02-27T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ccad2708-52b8-45ac-903d-ba3f277dab78.mp3","mime_type":"audio/mpeg","size_in_bytes":42254743,"duration_in_seconds":3286}]},{"id":"podlove-2022-02-28t02:34:53+00:00-13bad7c8ab4f96c","title":"Reflections On Designing A Data Platform From Scratch","url":"https://www.dataengineeringpodcast.com/data-platform-design-episode-268","content_text":"Summary\nBuilding a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nTimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale\nRudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.\nI’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the components that need to be considered when designing a solution?\n\nData integration (extract and load)\n\nWhat are your data sources?\nBatch or streaming (acceptable latencies)\n\n\nData storage (lake or warehouse)\n\nHow is the data going to be used?\nWhat other tools/systems will need to integrate with it?\nThe warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the \"modern data stack\"\n\n\nData orchestration\n\nWho will be managing the workflow logic?\n\n\nMetadata repository\n\nTypes of metadata (catalog, lineage, access, queries, etc.)\n\n\nSemantic layer/reporting\nData applications\n\n\nImplementation phases\n\nBuild a single end-to-end workflow of a data application using a single category of data across sources\nValidate the ability for an analyst/data scientist to self-serve a notebook powered analysis\nIterate\n\n\nRisks/unknowns\n\nData modeling requirements\nSpecific implementation details as integrations across components are built\nWhen to use a vendor and risk lock-in vs. spend engineering time\n\n\n\nContact Info\n\nEmail\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPresto\n\nPodcast Episode\n\n\nTrino\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nDremio\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.","date_published":"2022-02-27T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cd5e7922-139d-4c51-ac58-6d781b943d0a.mp3","mime_type":"audio/mpeg","size_in_bytes":30743579,"duration_in_seconds":2421}]},{"id":"podlove-2022-02-21t03:42:12+00:00-df4e5d8606c9ae0","title":"Build Your Python Data Processing Your Way And Run It Anywhere With Fugue","url":"https://www.dataengineeringpodcast.com/fugue-python-data-processing-episode-266","content_text":"Summary\nPython has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nEvery data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies.\nYour host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Fugue is and the story behind it?\nWhat are the core goals of the Fugue project?\nWho are the target users for Fugue and how does that influence the feature priorities and API design?\nHow does Fugue compare to projects such as Modin, etc. for abstracting over the execution engine?\nWhat are some of the sharp edges that contribute to the engineering effort required to migrate from a single machine to Spark or Dask?\nWhat are some of the determining factors that will influence the decision of whether to use Pandas, Spark, or Dask?\nCan you describe how Fugue is implemented?\n\nHow have the design and goals of the project changed or evolved since you started working on it?\n\n\nHow do you ensure the consistency of logic across execution engines?\nCan you describe the workflow of integrating Fugue into an existing or greenfield project?\nHow have you approached the work of automating logic optimization across execution contexts?\n\nWhat are some of the risks or error conditions that you have to guard against?\nHow do you manage validation of those optimizations, particularly as the different engines release new versions or capabilities?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Fugue used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Fugue?\nWhen is Fugue the wrong choice?\nWhat do you have planned for the future of Fugue?\n\nContact Info\n\nLinkedIn\nEmail\nFugue Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nFugue\nFugue Tutorials\nPrefect\n\nPodcast Episode\n\n\nBodo\n\nPodcast Episode\n\n\nPandas\nDuckDB\nKoalas\nDask\n\nPodcast Episode\n\n\nSpark\nModin\n\nPodcast.__init__ Episode\n\n\nFugue SQL\nFlink\nPyCaret\nANTLR\nOmniSci\nIbis\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Kevin Kho about the open source Fugue framework for abstracting away the execution engine for your Python data workflows so you can write it once and run it anywhere.","date_published":"2022-02-20T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2a6d67aa-2136-4cf0-92d7-4400d90cabf3.mp3","mime_type":"audio/mpeg","size_in_bytes":36943353,"duration_in_seconds":3667}]},{"id":"podlove-2022-02-21t03:16:01+00:00-f93b7bc65a493da","title":"Understanding The Immune System With Data At ImmunAI","url":"https://www.dataengineeringpodcast.com/immunai-life-sciences-episode-265","content_text":"Summary\nThe life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nYour host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system.\n\nInterview\n\nIntroduction (see Guy’s bio below)\nHow did you get involved in the area of data management?\nCan you describe what Immunai is and the story behind it?\nWhat are some of the categories of information that you are working with?\n\nWhat kinds of insights are you trying to power/questions that you are trying to answer with that data?\n\n\nWho are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data?\nWhat are some of the challenges unique to the biological data domain that you have had to address?\n\nWhat are some of the limitations in the off-the-shelf tools when applied to biological data?\nHow have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users?\n\n\nCan you describe the platform architecture that you are using to support your stakeholders?\n\nWhat are some of the constraints or requirements (e.g. regulatory, security, etc.) that you need to account for in the design?\n\n\nWhat are some of the ways that you make your data accessible to AI/ML engineers?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Immunai used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working at Immunai?\nWhat do you have planned for the future of the Immunai data platform?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nImmunAI\nApache Arrow\nColumbia Genome Center\nDagster\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Guy Yachdav about the work that he and his team are doing at ImmunAI to help researchers and scientists understand the immune system through data and machine learning.","date_published":"2022-02-20T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d96c5072-f77a-475c-87a0-946fb14fb097.mp3","mime_type":"audio/mpeg","size_in_bytes":32555240,"duration_in_seconds":2587}]},{"id":"podlove-2022-02-14t00:57:53+00:00-304bf89941234d1","title":"Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine","url":"https://www.dataengineeringpodcast.com/deephaven-real-time-query-engine-episode-264","content_text":"Summary\nStreaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nYour host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Deephaven is and the story behind it?\nWhat is the role of Deephaven in the context of an organization’s data platform?\n\nWhat are the upstream and downstream systems and teams that it is likely to be integrated with?\n\n\nWho are the target users of Deephaven and how does that influence the feature priorities and design of the platform?\ncomparison of use cases/experience with Materialize\nWhat are the different components that comprise the suite of functionality in Deephaven?\nHow have you architected the system?\n\nWhat are some of the ways that the goals/design of the platform have changed or evolved since you started working on it?\n\n\nWhat are some of the impedance mismatches that you have had to address between supporting different language environments and data access patterns? (e.g. batch/streaming/ML and Python/Java/R)\nCan you describe some common workflows that a data engineer might build with Deephaven?\n\nWhat are the avenues for collaboration across data roles and stakeholders?\n\n\nlicensing choice/governance model\nWhat are the most interesting, innovative, or unexpected ways that you have seen Deephaven used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Deephaven?\nWhen is Deephaven the wrong choice?\nWhat do you have planned for the future of Deephaven?\n\nContact Info\n\n@pete_paco on Twitter\n@deephaven on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nDeephaven\n\nGitHub\n\n\nMaterialize\n\nPodcast Episode\n\n\nArrow Flight\nkSQLDB\n\nPodcast Episode\n\n\nRedpanda\n\nPodcast Episode\n\n\nPandas\n\nPodcast Episode\n\n\nNumPy\nNumba\nBarrage\nDebezium\n\nPodcast Episode\n\n\nJPy\nSabermetrics\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Pete Goddard about the impressive engineering that he and his team have put into the Deephaven real time query engine for effortlessly working across streaming and static data in your preferred language.","date_published":"2022-02-13T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cf8a8e05-433c-41c1-ab00-7ae0c8c594bf.mp3","mime_type":"audio/mpeg","size_in_bytes":51576318,"duration_in_seconds":3725}]},{"id":"podlove-2022-02-14t00:32:26+00:00-1f00d6ca50f6b0e","title":"Build Your Own End To End Customer Data Platform With Rudderstack","url":"https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263","content_text":"Summary\nCollecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Soumyadeb Mitra about his experience as the founder of Rudderstack and its role in your data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Rudderstack is and the story behind it?\nWhat are the main use cases that Rudderstack is designed to support?\nWho are the target users of Rudderstack?\n\nHow does the availability of the managed cloud service change the user profiles that you can target?\nHow do these user profiles influence your focus and prioritization of features and user experience?\n\n\nHow would you characterize the position of Rudderstack in the current data ecosystem?\n\nWhat other tools/systems might you replace with Rudderstack?\n\n\nHow do you think about the application of Rudderstack compared to tools for data integration (e.g. Singer, Stitch, Fivetran) and reverse ETL (e.g. Grouparoo, Hightouch, Census)?\nCan you describe how the Rudderstack platform is designed and implemented?\n\nHow have the goals/design/use cases of Rudderstack changed or evolved since you first started working on it?\nWhat are the different extension points available for engineers to extend and customize Rudderstack?\n\n\nWorking with customer data is a core capability in Rudderstack. How do you manage the identity resolution of users as they transition back and forth between anonymous and identified?\n\nWhat are some of the data privacy primitives that you include to assist with data security/regulatory concerns?\n\n\nWhat is the process of getting started with Rudderstack as a software or data platform engineer?\nWhat are some of the operational challenges related to running your own deployment of Rudderstack?\nWhat are some of the overlooked/underemphasized capabilities of Rudderstack?\nHow have you approached the governance model/boundaries between OSS and commercial for Rudderstack?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Rudderstack used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Rudderstack?\nWhen is Rudderstack the wrong choice?\nWhat do you have planned for the future of Rudderstack?\n\nContact Info\n\nLinkedIn\n@soumyadeb_mitra on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nRudderstack\nHadoop\nSpark\nSegment\n\nPodcast Episode\n\n\nGrouparoo\n\nPodcast Episode\n\n\nFivetran\n\nPodcast Episode\n\n\nStitch\nSinger\n\nPodcast Episode\n\n\nCensus\n\nPodcast Episode\n\n\nHightouch\n\nPodcast Episode\n\n\nLiveRamp\nAirbyte\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Soumyadeb Mitra about the unique requirements for information processing in a customer data platform and how the open source Rudderstack platform allows you to customize it to meet your needs.","date_published":"2022-02-13T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7620add8-fc60-4988-a559-eab5ff9671a8.mp3","mime_type":"audio/mpeg","size_in_bytes":38060618,"duration_in_seconds":2854}]},{"id":"podlove-2022-02-06t22:21:20+00:00-aa98b8b11c4d88a","title":"Scale Your Spatial Analysis By Building It In SQL With Syntax Extensions","url":"https://www.dataengineeringpodcast.com/spatial-analysis-with-sql-episode-262","content_text":"Summary\nAlong with globalization of our societies comes the need to analyze the geospatial and geotemporal data that is needed to manage the growth in commerce, communications, and other activities. In order to make geospatial analytics more maintainable and scalable there has been an increase in the number of database engines that provide extensions to their SQL syntax that supports manipulation of spatial data. In this episode Matthew Forrest shares his experiences of working in the domain of geospatial analytics and the application of SQL dialects to his analysis.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nYour host is Tobias Macey and today I’m interviewing Matthew Forrest about doing spatial analysis in SQL\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what spatial SQL is and some of the use cases that it is relevant for?\ncompatibility with/comparison to syntax from PostGIS\nWhat is involved in implementation of spatial logic in database engines\nmapping geospatial concepts into declarative syntax\nfoundational data types\ndata modeling\nworkflow for analyzing spatial data sets outside of database engines\ntranslating from e.g. geopandas to SQL\nlevel of support in database engines for spatial data types\nWhat are the most interesting, innovative, or unexpected ways that you have seen spatial SQL used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with spatial SQL?\nWhen is SQL the wrong choice for spatial analysis?\nWhat do you have planned for the future of spatial analytics support in SQL for the Carto platform?\n\nContact Info\n\nLinkedIn\nWebsite\n@mbforr on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nCarto\nSpatial SQL Blog Post\nSpatial Analysis\nPostGIS\nQGIS\nKML\nShapefile\nGeoJSON\nPaul Ramsey’s Blog\nNorwegian SOSI\nGDAL\nGoogle Cloud Dataflow\nGeoBEAM\nCarto Data Observatory\nWGS84 Projection\nEPSG Code\nPySAL\nGeoMesa\nUber H3 Spatial Indexing\nPGRouting\nSpatialite\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Along with globalization of our societies comes the need to analyze the geospatial and geotemporal data that is needed to manage the growth in commerce, communications, and other activities. In order to make geospatial analytics more maintainable and scalable there has been an increase in the number of database engines that provide extensions to their SQL syntax that supports manipulation of spatial data. In this episode Matthew Forrest shares his experiences of working in the domain of geospatial analytics and the application of SQL dialects to his analysis.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Matthew Forrest about using SQL to build your spatial analysis workflows so that they are more maintainable and uniform","date_published":"2022-02-06T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1ca98ca8-a045-44b8-a5b1-e7042210afec.mp3","mime_type":"audio/mpeg","size_in_bytes":42602561,"duration_in_seconds":3594}]},{"id":"podlove-2022-02-06t21:46:38+00:00-55c91f464e5a9d8","title":"Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets","url":"https://www.dataengineeringpodcast.com/privacy-dynamics-data-privacy-strategies-episode-261","content_text":"Summary\nThere are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Will Thompson about managing data privacy concerns for data sets used in analytics and machine learning\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nData privacy is a multi-faceted problem domain. Can you start by enumerating the different categories of privacy concern that are involved in analytical use cases?\nCan you describe what Privacy Dynamics is and the story behind it?\n\nWhich categor(y|ies) are you focused on addressing?\n\n\nWhat are some of the best practices in the definition, protection, and enforcement of data privacy policies?\n\nIs there a data security/privacy equivalent to the OWASP top 10?\n\n\nWhat are some of the techniques that are available for anonymizing data while maintaining statistical utility/significance?\n\nWhat are some of the engineering/systems capabilities that are required for data (platform) engineers to incorporate these practices in their platforms?\n\n\nWhat are the tradeoffs of encryption vs. obfuscation when anonymizing data?\nWhat are some of the types of PII that are non-obvious?\nWhat are the risks associated with data re-identification, and what are some of the vectors that might be exploited to achieve that?\n\nHow can privacy risks mitigation be maintained as new data sources are introduced that might contribute to these re-identification vectors?\n\n\nCan you describe how Privacy Dynamics is implemented?\n\nWhat are the most challenging engineering problems that you are dealing with?\n\n\nHow do you approach validation of a data set’s privacy?\nWhat have you found to be useful heuristics for identifying private data?\n\nWhat are the risks of false positives vs. false negatives?\n\n\nCan you describe what is involved in integrating the Privacy Dynamics system into an existing data platform/warehouse?\n\nWhat would be required to integrate with systems such as Presto, Clickhouse, Druid, etc.?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Privacy Dynamics used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacy Dynamics?\nWhen is Privacy Dynamics the wrong choice?\nWhat do you have planned for the future of Privacy Dynamics?\n\nContact Info\n\nLinkedIn\n@willseth on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nPrivacy Dynamics\nPandas\n\nPodcast Episode – Pandas For Data Engineering\n\n\nHomomorphic Encryption\nDifferential Privacy\nImmuta\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Privacy Dynamics lead engineer Will Thompson about useful strategies for managing data privacy in your shared data sets.","date_published":"2022-02-06T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/37cf0ad6-29f9-44a8-a28d-3a90d6581722.mp3","mime_type":"audio/mpeg","size_in_bytes":46425050,"duration_in_seconds":3606}]},{"id":"podlove-2022-01-31t02:14:55+00:00-65624b9817f40f4","title":"A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know","url":"https://www.dataengineeringpodcast.com/a-few-things-every-data-engineer-should-know-episode-260","content_text":"Summary\nThe Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book \"97 Things Every Data Engineer Should Know\", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nYour host is Tobias Macey and today I’m doing something a bit different. I’m going to talk about some of the lessons that I have learned while running the podcast, compiling the book \"97 Things Every Data Engineer Should Know\", and some of the themes that I’ve observed throughout.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nOverview of the 97 things book\n\nHow the project came about\nGoals of the book\n\n\nWhat are the paths into data engineering?\nWhat are some of the macroscopic themes in the industry?\nWhat are some of the microscopic details that are useful/necessary to succeed as a data engineer?\nWhat are some of the career/team/organizational details that are helpful for data engineers?\nWhat are the most interesting, innovative, or unexpected outcomes/feedback that I have seen from running the podcast and working on the book?\nWhat are the most interesting, unexpected, or challenging lessons that I have learned while working on the Data Engineering Podcast and 97 things book?\nWhat do I have planned for the future of the podcast?\n\nContact Info\n\nLinkedIn\nEmail\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\n97 Things Every Data Engineer Should Know\n\nBuy on Amazon (affiliate link)\nRead on O’Reilly Learning\nO’Reilly Learning 30 Day Free Trial\n\n\nPodcast.__init__\nPipeline Academy data engineering bootcamp\n\nPodcast Episode\n\n\nHadoop\nObject Relational Mapper (ORM)\nSinger\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nData Mesh\n\nPodcast Episode\n\n\nData Contracts Episode\nDesigning Data Intensive Applications\nData Council\n\n2022 Conference\n\n\nData Engineering Weekly Newsletter\nData Mesh Learning\nMLOps Community\nAnalytics Engineering Newsletter\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book "97 Things Every Data Engineer Should Know", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An exploration of the macroscopic and microscopic themes and details that are useful for new and experienced data engineers to know in order to grow their careers.","date_published":"2022-01-30T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1736eacf-0b9b-4b19-a8ce-085d355f2d82.mp3","mime_type":"audio/mpeg","size_in_bytes":38087003,"duration_in_seconds":2495}]},{"id":"podlove-2022-01-31t01:21:01+00:00-33d1ebf5ab05ef8","title":"Effective Pandas Patterns For Data Engineering","url":"https://www.dataengineeringpodcast.com/pandas-patterns-for-data-engineering-episode-259","content_text":"Summary\nPandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are the main tasks that you have seen Pandas used for in a data engineering context?\nWhat are some of the common mistakes that can lead to poor performance when scaling to large data sets?\nWhat are some of the utility features that you have found most helpful for data processing?\nOne of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas?\nPandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows?\nPandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maintainable?\n\nHow can data engineers support testing their transformations?\n\n\nThere are a number of projects that aim to scale Pandas logic across cores and clusters. What are some of the considerations for when to use one of these tools, and how to select the proper framework? (e.g. Dask, Modin, Ray, etc.)\nWhat are some anti-patterns that engineers should guard against when using Pandas for data processing?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Pandas used for data processing?\nWhen is Pandas the wrong choice for data processing?\nWhat are some of the projects related to Pandas that you are keeping an eye on?\n\nContact Info\n\n@__mharrison__ on Twitter\nmetasnake\nEffective Pandas Bundle (affiliate link with 20% discount code applied)\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMetasnake\nSnowflake Schema\nOLAP\nPanel Data\nNumPy\nDask\n\nPodcast Episode\n\n\nParquet\nArrow\nFeather\nZen of Python\nJoel Grus’ I Don’t Like Notebooks presentation\nPandas Method Chaining\nEffective Pandas Book (affiliate link with 20% discount code applied)\n\nPodcast.__init__ Episode\n\n\npytest\n\nPodcast.__init__ Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\nHypothesis\n\nPodcast.__init__ Episode\n\n\nPapermill\n\nPodcast Episode\n\n\nJupytext\nKoalas\nModin\n\nPodcast.__init__ Episode\n\n\nSpark\nRay\n\nPodcast.__init__ Episode\n\n\nSpark Pandas API\nVaex\nRapids\nTerality\nH2O\nH2O DataTable\nFugue\nIbis\nMulti-process Pandas\nPandaPy\nPolars\nGoogle Colab\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Matt Harrison about how to write effective pandas code for scalable and maintainable data processing logic that can be understood by other members of your team.","date_published":"2022-01-30T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1baee24d-c79f-4cb4-b948-e26c284fe0d1.mp3","mime_type":"audio/mpeg","size_in_bytes":49758294,"duration_in_seconds":3621}]},{"id":"podlove-2022-01-23t12:44:18+00:00-115b3cb83ff5d19","title":"The Importance Of Data Contracts As The Interface For Data Integration With Abhi Sivasailam","url":"https://www.dataengineeringpodcast.com/data-contracts-for-data-mesh-episode-258","content_text":"Summary\nData platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to make this a tractable problem it is necessary to define boundaries for communication between concerns, which brings with it the need to establish interface contracts for communicating across those boundaries. The recent move toward the data mesh as a formalized architecture that builds on this design provides the language that data teams need to make this a more organized effort. In this episode Abhi Sivasailam shares his experience designing and implementing a data mesh solution with his team at Flexport, and the importance of defining and enforcing data contracts that are implemented at those domain boundaries.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nYour host is Tobias Macey and today I’m interviewing Abhi Sivasailam about the different social and technical interfaces available for defining and enforcing data contracts\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what your working definition of a \"data contract\" is?\n\nWhat are the goals and purpose of these contracts?\n\n\nWhat are the locations and methods of defining a data contract?\n\nWhat kind of information needs to be encoded in a contract definition?\n\n\nHow do you manage enforcement of contracts?\nmanifestations of contracts in data mesh implementation\nergonomics (technical and social) of data contracts and how to prevent them from prohibiting productivity\nWhat are the most interesting, innovative, or unexpected approaches to data contracts that you have seen?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data contract implementation?\nWhen are data contracts the wrong choice?\n\nContact Info\n\nLinkedIn\n@_abhisivasailam on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nFlexport\nDebezium\n\nPodcast Episode\n\n\nData Mesh At Flexport Presentation\nData Mesh\n\nPodcast Episode\n\n\nColumn Names As Contracts podcast episode with Emily Riederer\ndbtplyr\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to make this a tractable problem it is necessary to define boundaries for communication between concerns, which brings with it the need to establish interface contracts for communicating across those boundaries. The recent move toward the data mesh as a formalized architecture that builds on this design provides the language that data teams need to make this a more organized effort. In this episode Abhi Sivasailam shares his experience designing and implementing a data mesh solution with his team at Flexport, and the importance of defining and enforcing data contracts that are implemented at those domain boundaries.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Abhi Sivasailam about his work at Flexport to design and implement a data mesh solution that relies heavily on data contracts to provide a stable interface that teams can implement for integrating analytical workflows across the organization.","date_published":"2022-01-23T14:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/59c6c262-05d0-45ba-93f4-e72ec8059249.mp3","mime_type":"audio/mpeg","size_in_bytes":40074659,"duration_in_seconds":3360}]},{"id":"podlove-2022-01-23t12:14:20+00:00-56a7249789cbc77","title":"Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig","url":"https://www.dataengineeringpodcast.com/ashish-mrig-data-platforms-data-teams-episode-257","content_text":"Summary\nData engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform for Wayfair, as well as running a local data engineering meetup. In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Ashish Mrig about his path as a data engineer\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nYou currently lead a data engineering team at a relatively large company. What are the topics that account for the majority of your time and energy?\nWhat are some of the most valuable lessons that you’ve learned about managing and motivating teams of data professionals?\nWhat has been your most consistent challenge across the different generations of the data ecosystem?\nHow is your current data platform architected?\nGiven the current state of the technology and services landscape, how would you approach the design and implementation of a greenfield rebuild of your platform?\nWhat are some of the pitfalls that you have seen data teams encounter most frequently?\nYou are running a data engineering meetup for your local community in the Boston area. What have been some of the recurring themes that are discussed in those events?\n\nContact Info\n\nMedium Blog\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nWayfair\nTivo\nInfluxDB\n\nPodcast Interview\n\n\nBigQuery\nAtScale\n\nPodcast Episode\n\n\nData Engineering Boston\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform for Wayfair, as well as running a local data engineering meetup. In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ashish Mrig about his career in data engineering, his experiences managing data teams at Wayfair, and the technical considerations that factor into platform design decisions in large organizations.","date_published":"2022-01-23T14:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4fa67f0c-0d3b-4c26-b43c-163d600e32a0.mp3","mime_type":"audio/mpeg","size_in_bytes":38540267,"duration_in_seconds":3164}]},{"id":"podlove-2022-01-15t19:35:21+00:00-9d6e4fa338c157f","title":"Automated Data Quality Management Through Machine Learning With Anomalo","url":"https://www.dataengineeringpodcast.com/anomalo-data-quality-platform-episode-256","content_text":"Summary\nData quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nThe only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.\nYour host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Anomalo is and the story behind it?\nManaging data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo?\n\nWhat are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems?\n\n\ntypes of data quality issues identified\n\nutility of automated vs programmatic tests\n\n\nCan you describe how the Anomalo system is designed and implemented?\n\nHow have the design and goals of the platform changed or evolved since you started working on it?\n\n\nWhat is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test?\nmodel training/customization process\nstatistical model\nseasonality/windowing\nCI/CD\nWith any monitoring system the most challenging thing to do is avoid generating alerts that aren’t actionable or helpful. What is your strategy for helping your customers avoid alert fatigue?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Anomalo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomalo?\nWhen is Anomalo the wrong choice?\nWhat do you have planned for the future of Anomalo?\n\nContact Info\n\nElliot\n\nLinkedIn\n@eshmu on Twitter\n\n\nJeremy\n\nLinkedIn\n@jeremystan on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nAnomalo\nGreat Expectations\n\nPodcast Episode\n\n\nShapley Values\nGradient Boosted Decision Tree\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of Anomalo about how they are using statistical machine learning systems to automate the detection and diagnosis of data quality issues that occur in your data warehouse.","date_published":"2022-01-15T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bc22f7b9-adf7-4853-899a-61e760a25544.mp3","mime_type":"audio/mpeg","size_in_bytes":48430142,"duration_in_seconds":3750}]},{"id":"podlove-2022-01-15t19:30:53+00:00-44ba3f9ce1ff628","title":"An Introduction To Data And Analytics Engineering For Non-Programmers","url":"https://www.dataengineeringpodcast.com/building-data-products-book-episode-255","content_text":"Summary\nApplications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book \"Building Data Products\" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nYour host is Tobias Macey and today I’m interviewing Brian McMillan about building data products and his book to introduce the work of data analysts and engineers to non-programmers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what motivated you to write a book about the work of building data products?\n\nWho is your target audience?\nWhat are the main goals that you are trying to achieve through the book?\n\n\nWhat was your approach for determining the structure and contents of the book?\nWhat are the core principles of data engineering that have remained from the original wave of ETL tools and rigid data warehouses?\n\nWhat are some of the new foundational elements of data products that need to be codified for the next generation of organizations and data professionals?\n\n\nThere is a lot of activity and conversation happening in and around data which can make it difficult to understand which parts are signal and which are noise. What, if anything, do you see as being truly new and/or innovative?\n\nAre there any core lessons or principles that you consider to be at risk of getting drowned out in the current frenzy of activity?\n\n\nHow do the practices for building products with small teams differ from those employed by larger groups?\n\nWhat do you see as the threshold beyond which a team can no longer be considered \"small\"?\n\n\nWhat are the roles/skills/titles that you view as necessary for building data products in the current phase of maturity for the ecosystem?\nWhat do you see as the biggest risks to engineering and data teams?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the principles in the book used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the book?\n\nContact Info\n\nEmail\ntwitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nBuilding Data Products: Introduction to Data and Analytics Engineering for non-programmers\nTheory of Constraints\nThroughput Economics\n\"Swaptronics\" – The act of swapping out electronic components until you find a combination that works.\nInformatica\nSSIS – Microsoft SQL Server Integration Services\n3X – Kent Beck\nWardley Maps\nVega Lite\nDatasette\nWhy Use Make – Mike Bostock\nBuilding Production Applications Using Go & SQLite\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Applications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book "Building Data Products" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Brian McMillan about his work on the book "Building Data Products" and how he is bringing data professionals and business users into alignment for creating the systems that are necessary for organizations to succeed in the modern era.","date_published":"2022-01-15T14:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4bb1ad3e-d345-4b9f-90a2-5fcc186a3c7b.mp3","mime_type":"audio/mpeg","size_in_bytes":38343253,"duration_in_seconds":3013}]},{"id":"podlove-2022-01-08t02:24:38+00:00-8662db46a1e2ee4","title":"Open Source Reverse ETL For Everyone With Grouparoo","url":"https://www.dataengineeringpodcast.com/grouparoo-open-source-reverse-etl-episode-254","content_text":"Summary\nReverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Brian Leonard about Grouparoo, an open source framework for managing your reverse ETL pipelines\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Grouparoo is and the story behind it?\nWhat are the core requirements for building a reverse ETL system?\n\nWhat are the additional capabilities that users of the system ask for as they get more advanced in their usage?\n\n\nWho is your target user for Grouparoo and how does that influence your priorities on feature development and UX design?\nWhat are the benefits of building an open source core for a reverse ETL platform as compared to the other commercial options?\nCan you describe the architecture and implementation of the Grouparoo project?\n\nWhat are the additional systems that you have built to support the hosted offering?\nHow have the design and goals of the project changed since you first started working on it?\n\n\nWhat is the workflow for getting Grouparoo deployed and set up with an initial pipeline?\nHow does Grouparoo handle model and schema evolution and potential mismatch in the data warehouse and destination systems?\nWhat is the process for building a new integration and getting it included in the official list of plugins?\nWhat is your strategy/philosophy around which features are included in the open source vs. hosted/enterprise offerings?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Grouparoo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Grouparoo?\nWhen is Grouparoo the wrong choice?\nWhat do you have planned for the future of Grouparoo?\n\nContact Info\n\nLinkedIn\n@bleonard on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nGrouparoo\n\nGitHub\n\n\nTask Rabbit\nSnowflake\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\nCustomer Data Platform\n\nPodcast Episode\n\n\ndbt\nOpen Source Data Stack Conference\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Brian Leonard about the open source reverse ETL framework Grouparoo and how you can start using it today.","date_published":"2022-01-07T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cd4a6072-2771-41f1-a439-373f2810253f.mp3","mime_type":"audio/mpeg","size_in_bytes":36156975,"duration_in_seconds":2696}]},{"id":"podlove-2022-01-08t01:26:42+00:00-d0cb3d4366c5aeb","title":"Data Observability Out Of The Box With Metaplane","url":"https://www.dataengineeringpodcast.com/metaplane-data-observability-platform-episode-253","content_text":"Summary\nData observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Metaplane is and the story behind it?\nData observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term?\n\nWhat are the areas of differentiation that you see across vendors in the space?\n\n\nCan you describe how the Metaplane platform is architected?\n\nHow have the design and goals of Metaplane changed or evolved since you started working on it?\n\n\nestablishing seasonality in data metrics\nblind spots from operating at the level of the data warehouse\nWhat are the most interesting, innovative, or unexpected ways that you have seen Metaplane used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaplane?\nWhen is Metaplane the wrong choice?\nWhat do you have planned for the future of Metaplane?\n\nContact Info\n\nLinkedIn\n@kevinzhenghu on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMetaplane\nDatadog\nControl Theory\nJames Clerk Maxwell\nCentrifugal Governor\nHuygens\nAmazon ECS\nStop Hiring Devops Experts (And Start Growing Them)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Kevin Hu about his work on Metaplane to make implementing data observability practices as low friction as possible for data teams and organizations.","date_published":"2022-01-07T20:30:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/651543d8-946c-4902-b285-4bdf063811a9.mp3","mime_type":"audio/mpeg","size_in_bytes":42043462,"duration_in_seconds":3047}]},{"id":"podlove-2022-01-02t03:56:02+00:00-479ed0a4eaca43d","title":"Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary","url":"https://www.dataengineeringpodcast.com/controlled-vocabulary-with-dbtplyr-episode-252","content_text":"Summary\nCommunication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.)\nWhat are some of the types of contracts that can, and should, be defined and enforced in data workflows?\n\nWhat are the boundaries where we should think about establishing those contracts?\n\n\nWhat is the utility of column and table names for defining and enforcing contracts in analytical work?\nWhat is the process for establishing contractual elements in a naming schema?\n\nWho should be involved in that design process?\nWho are the participants in the communication paths for column naming contracts?\n\n\nWhat are some examples of context and details that can’t be captured in column names?\n\nWhat are some options for managing that additional information and linking it to the naming contracts?\n\n\nCan you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects?\n\nHow does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows\nHow are you using dbtplyr in your own work?\n\n\nHow do you handle the work of building transformations to make data comply with contracts?\nWhat are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers?\nWhat are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr?\nWhen is dbtplyr the wrong choice?\nWhat do you have planned for the future of dbtplyr?\n\nContact Info\n\nTwitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\ndbtplyr\nGreat Expectations\n\nPodcast Episode\n\n\nControlled Vocabularies Presentation\ndplyr\nData Vault\n\nPodcast Episode\n\n\nOpenMetadata\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Emily Riederer about the work of establishing a controlled vocabulary for building a shared context in your data warehouse to reduce communication overhead.","date_published":"2022-01-01T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e514b67f-1d62-4c9f-acf1-e77fb8c7ddd9.mp3","mime_type":"audio/mpeg","size_in_bytes":51223395,"duration_in_seconds":3634}]},{"id":"podlove-2022-01-02t03:20:29+00:00-912ac99161648b1","title":"A Reflection On The Data Ecosystem For The Year 2021","url":"https://www.dataengineeringpodcast.com/data-ecosystem-year-in-review-2021-episode-251","content_text":"Summary\nThis has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\n\n\nWhat were the main themes that you saw data practitioners and vendors focused on this year?\n\n\nWhat is the major bottleneck for Data teams in 2021? Will it be the same in 2022?\nOne of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?\n\n\nWill SQL be challenged as a primary interface to analytical data?\nIn 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.\n\n\nTo what extent does speed matter?\nOver the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move?\n\n\nHow has the way data teams work been changing?\nIn 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders?\n\n\nWhat’s it like to be a data vendor in 2021?\n\n\nVertically integrated vs. modular data stack?\nThere are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack?\n\n\nContact Info\n\nMaura\n\nLinkedIn\nWebsite\n@outoftheverse on Twitter\n\n\nDavid\n\nLinkedIn\n@davidjwallace on Twitter\ndwallace0723 on GitHub\n\n\nBenn\n\nLinkedIn\n@bennstancil on Twitter\n\n\nGleb\n\nLinkedIn\n@glebmm on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nPatreon\nDutchie\nMode Analytics\nDatafold\n\nPodcast Episode\n\n\nLocally Optimistic\nRJ Metrics\nStitch\nMozart Data\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A wide ranging conversation among a panel of data professionals about their view on the past year's trends in the data management and analytics ecosystem and what we might expect for the year to come.","date_published":"2022-01-01T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2e372743-a5ab-4d8c-8374-b8e05f799ed9.mp3","mime_type":"audio/mpeg","size_in_bytes":49414015,"duration_in_seconds":3809}]},{"id":"podlove-2021-12-26t19:55:18+00:00-44c213bea28074a","title":"Exploring The Evolving Role Of Data Engineers","url":"https://www.dataengineeringpodcast.com/redefining-data-engineering-episode-249","content_text":"Summary\nData Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is your current working definition of a data engineer?\n\nHow has that definition changed since your article on the \"rise of the data engineer\" and episode 3 of this show about \"defining data engineering\"?\n\n\nHow has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective?\n\nHow should a new/aspiring data engineer focus their time and energy to become effective?\n\n\nOne of the core themes in this current spate of technologies is \"democratization of data\". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the \"modern data stack\" balancing these concerns?\nAn interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services?\nWith the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack?\n\nThere is also a lot of conversation around the \"modern data stack\", as well as the need for companies to build a \"data platform\". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform?\n\n\nHow do you view the long term viability of templated SQL as a core workflow for transformations?\nWhat is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure?\nHow evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices?\nWhat are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem?\nIn episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years?\n\nContact Info\n\nLinkedIn\n@mistercrunch on Twitter\nmistercrunch on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nHow the Modern Data Stack is Reshaping Data Engineering\nThe Rise of the Data Engineer\nThe Downfall of the Data Engineer\nDefining Data Engineering – Data Engineering Podcast\nAirflow\nSuperset\n\nPodcast Episode\n\n\nPreset\nFivetran\n\nPodcast Episode\n\n\nMeltano\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nRalph Kimball\nBill Inmon\nFeature Store\nProphecy.io\n\nPodcast Episode\n\n\nAb Initio\nDremio\n\nPodcast Episode\n\n\nData Mesh\n\nPodcast Episode\n\n\nFirebolt\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Maxime Beauchemin about how the technological progression in the data ecosystem is driving a constant change in the role and responsibilities of data engineers.","date_published":"2021-12-26T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d7e8fc72-3640-4ef3-bca3-496c83c2a2d3.mp3","mime_type":"audio/mpeg","size_in_bytes":44846842,"duration_in_seconds":3461}]},{"id":"podlove-2021-12-26t20:05:23+00:00-68e6e4b69d92de9","title":"Revisiting The Technical And Social Benefits Of The Data Mesh","url":"https://www.dataengineeringpodcast.com/data-mesh-revisited-episode-250","content_text":"Summary\nThe data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving a brief recap of the principles of the data mesh and the story behind it?\nHow has your view of the principles of the data mesh changed since our conversation in July of 2019?\nWhat are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh?\nWhat do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation?\nIn the opening of your book you state that \"Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale\". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution?\nWhat are some of the ways that data mesh concepts manifest at the boundaries of organizations?\nWhile the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem?\nWhat are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh?\nWhat are the technological components that are still missing for a mesh-native data platform?\nHow should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff?\n\nHow can vendors factor into the solution?\n\n\nWhat is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations?\nWhen is a data mesh the wrong approach?\nWhat do you think the future of the data mesh will look like?\n\nContact Info\n\nLinkedIn\n@zhamakd on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nData Engineering Podcast Data Mesh Interview\nData Mesh Book\nThoughtworks\nExpert Systems\nOpenLineage\n\nPodcast Episode\n\n\nData Mesh Learning\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Zhamak Dehghani about her experience working with the community that has grown up around her idea of the data mesh and the lessons that she has learned.","date_published":"2021-12-26T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/94359ff9-9582-4eff-8643-6be94d01a4df.mp3","mime_type":"audio/mpeg","size_in_bytes":54470972,"duration_in_seconds":4253}]},{"id":"podlove-2021-12-21t14:14:04+00:00-8da2677c1ad0382","title":"Fast And Flexible Headless Data Analytics With Cube.JS","url":"https://www.dataengineeringpodcast.com/cubejs-open-source-headless-data-analytics-episode-248","content_text":"Summary\nOne of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Cube is and the story behind it?\nWhat are the main use cases and platform architectures that you are focused on?\n\nWho are the target personas that will be using and managing Cube.js?\n\n\nThe name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem?\n\nHow does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer?\n\n\nWhat are the pieces of a data platform that might be replaced by Cube.js?\nCan you describe the design and architecture of the Cube platform?\n\nHow has the focus and target use case for the Cube platform evolved since you first started working on it?\n\n\nOne of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework?\nWhat is your overarching design philosophy for the API of the Cube system?\nCan you talk through the workflow of someone building a cube and querying it from a downstream system?\n\nWhat do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js?\n\n\nWhat are some of the data modeling steps that are needed in the source systems?\nThe perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube?\nWhat are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js?\n\nWhat are the opportunities for composing multiple cubes together to form a higher level aggregation?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Cube.js used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?\nWhen is Cube the wrong choice?\nWhat do you have planned for the future of Cube?\n\nContact Info\n\nArtom\n\nkeydunov on GitHub\n@keydunov on Twitter\nLinkedIn\n\n\nPavel\n\nLinkedIn\n@paveltiunov87 on Twitter\npaveltiunov on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nCube.js\nStatsbot\nchart.js\nHighcharts\nD3\nOLAP Cube\ndbt\nSuperset\n\nPodcast Episode\n\n\nStreamlit\n\nPodcast.__init__ Episode\n\n\nParquet\nHasura\nkSQLDB\n\nPodcast Episode\n\n\nMaterialize\n\nPodcast Episode\n\n\nMeltano\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the creators of Cube.JS about their work to build an open source framework for performant OLAP queries delivered through web and SQL APIs.","date_published":"2021-12-21T09:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6196b5cf-1741-4d4e-8f82-0da0c4d60fa8.mp3","mime_type":"audio/mpeg","size_in_bytes":51471243,"duration_in_seconds":3283}]},{"id":"podlove-2021-12-20t00:05:43+00:00-da0f95a166a69ec","title":"Building A System Of Record For Your Organization's Data Ecosystem At Metaphor","url":"https://www.dataengineeringpodcast.com/metaphor-metadata-system-of-record-episode-247","content_text":"Summary\nBuilding a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Metaphor is and the story behind it?\nOn your site it states that you are aiming to be the \"system of record\" for your data platform. Can you unpack that statement and its implications?\n\nWhat are the shortcomings in the \"data catalog\" approach to metadata collection and presentation?\n\n\nWho are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing?\n\nHow has that focus informed your priorities for user experience design and feature development?\n\n\nCan you describe how the Metaphor platform is architected?\n\nWhat are the lessons that you learned from your work at DataHub that have informed your work on Metaphor?\n\n\nThere has been a huge amount of focus on the \"modern data stack\" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse?\nWhat are some examples of information that you can extract through integrations with an organization’s communication platforms?\n\nCan you talk through a few example workflows where that information is used to inform the actions taken by a team member?\n\n\nWhat is your philosophy around data modeling or schema standardization for metadata records?\n\nWhat are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor?\n\n\nWhat are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Metaphor used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor?\nWhen is Metaphor the wrong choice?\nWhat do you have planned for the future of Metaphor?\n\nContact Info\n\nPardhu\n\nLinkedIn\n@PardhuGunnam on Twitter\n\n\nMars\n\nLinkedIn\nmars-lan on GitHub\n@mars_lan on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMetaphor\n\nThe Modern Metadata Platform\nWhy cant I find the right data?\n\n\nDataHub\nTransform\n\nPodcast Episode\n\n\nSupergrain\nMetriQL\n\nPodcast Episode\n\n\ndbt\n\nPodcast Interview\n\n\nOpenMetadata\n\nPodcast Interview\n\n\nPegasus Data Language\nModern Data Experience\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of Metaphor about their work to build a system of record for all of the data in your organization that bridges the technical and social requirements of your teams.","date_published":"2021-12-19T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a6f5d7d6-c002-43c9-8309-ad38a486d2cd.mp3","mime_type":"audio/mpeg","size_in_bytes":49877686,"duration_in_seconds":3933}]},{"id":"podlove-2021-12-12t02:58:57+00:00-6e2ea04f868e975","title":"Building Auditable Spark Pipelines At Capital One","url":"https://www.dataengineeringpodcast.com/spark-data-enrichment-capital-one-episode-246","content_text":"Summary\nSpark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the types of data and workflows that you are responsible for at Capital one?\n\nIn terms of the three \"V\"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with?\n\n\nWhat are some of the business and regulatory requirements that have to be factored into the solutions that you design?\nWho are the consumers of the data assets that you are producing?\nCan you describe the technical elements of the platform that you use for managing your data pipelines?\nWhat are the various ways that you are using Spark at Capital One?\nYou wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation?\n\nWhat were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages?\n\n\nWhat are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps?\nWhat are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post?\nWhat are some of the limitations of Spark that you have experienced during your work at Capital One?\n\nHow have you worked around them?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One?\nWhat are some of the upcoming projects that you are focused on/excited for?\n\nHow has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on?\n\n\n\nContact Info\n\n@gocool_p on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nApache Spark\nBlog Post\nDatabricks Presentation\nDelta Lake\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Capital One engineer Gokul Prabagaren about his work on building Spark workflows with a data enrichment approach to provide auditable data transformations.","date_published":"2021-12-12T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/420d9f89-083d-49ee-b4a3-2e2d8bc21127.mp3","mime_type":"audio/mpeg","size_in_bytes":33957986,"duration_in_seconds":2529}]},{"id":"podlove-2021-12-12t01:26:38+00:00-d29828e305b651e","title":"Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform","url":"https://www.dataengineeringpodcast.com/unomi-customer-data-platform-episode-245","content_text":"Summary\nThe core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Serge Huber about Apache Unomi, an open source customer data platform designed to manage customers, leads and visitors data and help personalize customers experiences\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Unomi is and the story behind it?\nWhat are the goals and target use cases of Unomi?\nWhat are the aspects of collecting and aggregating profile information that present challenges to developers?\n\nHow does the design of Unomi reduce that burden?\n\n\nHow does the focus of Unomi compare to systems such as Segment/Rudderstack or Optimizely for collecting user interactions and applying personalization?\nHow does Unomi fit in the architecture of an application or data infrastructure?\nCan you describe how Unomi itself is architected?\n\nHow have the goals and design of the project changed or evolved since it started?\nWhat are some of the most complex or challenging engineering projects that you have worked through?\n\n\nCan you describe the workflow of using Unomi to manage a set of customer profiles?\nWhat are some examples of user experience customization that you can build with Unomi?\n\nWhat are some alternative architectures that you have seen to produce similar capabilities?\n\n\nOne of the interesting features of Unomi is the end-user profile management. What are some of the system and developer challenges that are introduced by that capability? (e.g. constraints on data manipulation, security, privacy concerns, etc.)\nHow did Unomi manage privacy concerns and the GDPR ?\nHow does Unomi help with the new third party data restrictions ?\nWhy is access to raw data so important ?\nCould cloud providers offer Unomi as a service ?\nHow have you used Unomi in your own work?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Unomi used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Unomi?\nWhen is Unomi the wrong choice?\nWhat do you have planned for the future of Unomi?\n\nContact Info\n\nLinkedIn\n@sergehuber on Twitter\n@bhillou on Twitter\nsergehuber on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nApache Unomi\nJahia\nOASIS Open Foundation\nSegment\n\nPodcast Episode\n\n\nRudderstack\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the open source Unomi framework for building a customer data platform and how it can provide a personalized experience to your audience.","date_published":"2021-12-11T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/55658743-d527-45d8-8c81-1a0c42d4e952.mp3","mime_type":"audio/mpeg","size_in_bytes":47806343,"duration_in_seconds":3453}]},{"id":"podlove-2021-12-04t12:11:42+00:00-73e0f9c19aede4f","title":"Data Driven Hiring For Data Professionals With Alooba","url":"https://www.dataengineeringpodcast.com/alooba-hiring-data-professionals-episode-243","content_text":"Summary\nHiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Tim Freestone about Alooba, an assessment platform for evaluating data and analytics candidates to improve hiring outcomes for data roles.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Alooba is and the story behind it?\nWhat are the main goals that you are trying to achieve with Alooba?\nWhat are the main challenges that employers and candidates face when navigating their respective roles in the hiring process?\n\nWhat are some of the difficulties that are specific to data oriented roles?\n\n\nWhat are some of the complexities involved in designing a user experience that is positive and productive for both candidates and companies?\nWhat are some strategies that you have developed for establishing a fair and consistent baseline of skills to ensure consistent comparison across candidates?\nOne of the problems that comes from test-based skills assessment is the implicit bias toward candidates who test well. How do you work to mitigate that in the candidate evaluation process?\nCan you describe how the Alooba platform itself is implemented?\n\nHow have the goals and design of the system changed or evolved since you first started it?\nWhat are some of the ways that you use Alooba internally?\n\n\nHow do you stay up to date with the evolving skill requirements as roles change and new roles are created?\nBeyond evaluation of candidates for hiring, what are some of the other features that you have added to Alooba to support organizations in their effort to gain value from their data?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Alooba used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Alooba?\nWhen is Alooba the wrong choice?\nWhat do you have planned for the future of Alooba?\n\nContact Info\n\nLinkedIn\n@timmyfreestone on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAlooba\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Alooba founder Tim Freestone about the challenges of interviewing data professionals and how he is working to provide a more detailed view of candidates abilities through high quality skills assessments.","date_published":"2021-12-04T07:15:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0157e3fb-c717-4fef-ab9c-ccec58713d7d.mp3","mime_type":"audio/mpeg","size_in_bytes":39410915,"duration_in_seconds":3002}]},{"id":"podlove-2021-12-04t12:16:48+00:00-87559b6cb8f514e","title":"Experimentation and A/B Testing For Modern Data Teams With Eppo","url":"https://www.dataengineeringpodcast.com/eppo-experimentation-ab-testing-episode-244","content_text":"Summary\nA/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Chetan Sharma about Eppo, a platform for building A/B experiments that are easier to manage\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Eppo is and the story behind it?\nWhat are some examples of the kinds of experiments that teams and organizations might want to conduct?\nWhat are the points of friction that\nWhat are the steps involved in designing, deploying, and analyzing the outcomes of an A/B experiment?\n\nWhat are some of the statistical errors that are common when conducting an experiment?\n\n\nWhat are the design and UX principles that you have focused on in Eppo to improve the workflow of building and analyzing experiments?\nCan you describe the system design of the Eppo platform?\n\nWhat are the services or capabilities external to Eppo that are required for it to be effective?\nWhat are the integration points for adding Eppo to an organization’s existing platform?\n\n\nBeyond the technical capabilities for running experiments there are a number of design requirements involved. Can you talk through some of the decisions that need to be made when deciding what to change and how to measure its impact?\nAnother difficult element of managing experiments is understanding how they all interact with each other when running a large number of simultaneous tests. How does Eppo help with tracking the various experiments and the cohorts that are bucketed into each?\nWhat are some of the ideas or assumptions that you had about the technical and design aspects of running experiments that have been challenged or changed while building Eppo?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Eppo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Eppo?\nWhen is Eppo the wrong choice?\nWhat do you have planned for the future of Eppo?\n\nContact Info\n\nLinkedIn\n@chesharma87 on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nEppo\nKnowledge Repo\nApache Hive\nFrequentist Statistics\nRudderstack\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Eppo founder Chetan Sharma about the challenges of designing, running, and analyzing product experiments and the work that he is doing to make it more accessible to organizations of every size.","date_published":"2021-12-04T07:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b8c8c56d-1af7-4109-bc88-9cb10112483b.mp3","mime_type":"audio/mpeg","size_in_bytes":52350922,"duration_in_seconds":3480}]},{"id":"podlove-2021-11-27t20:26:26+00:00-640dedb61c091e4","title":"Creating A Unified Experience For The Modern Data Stack At Mozart Data","url":"https://www.dataengineeringpodcast.com/mozart-data-modern-data-stack-episode-242","content_text":"Summary\nThe modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Peter Fishman and Dan Silberman about Mozart Data and how they are building a unified experience for the modern data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Mozart Data is and the story behind it?\nThe promise of the \"modern data stack\" is that it’s all delivered as a service to make it easier to set up. What are the missing pieces that make something like Mozart necessary?\nWhat are the main workflows or industries that you are focusing on?\nWho are the main personas that you are building Mozart for?\n\nHow has that combination of user persona and industry focus informed your decisions around feature priorities and user experience?\n\n\nCan you describe how you have architected the Mozart platform?\n\nHow have you approached the build vs. buy decision internally?\nWhat are some of the most interesting or challenging engineering projects that you have had to work on while building Mozart?\n\n\nWhat are the stages of the data lifecycle that you work the hardest to automate, and which do you focus on exposing to customers?\nWhat are the edge cases in what customers might try to do in the bounds of Mozart, or areas where you have explicitly decided not to include in your features?\n\nWhat are the options for extensibility, or custom engineering when customers encounter those situations?\n\n\nWhat do you see as the next phase in the evolution of the data stack?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Mozart used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Mozart?\nWhen is Mozart the wrong choice?\nWhat do you have planned for the future of Mozart?\n\nContact Info\n\nPeter\n\nLinkedIn\n@peterfishman on Twitter\n\n\nDan\n\nLinkedIn\nsilberman on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nMozart Data\nModern Data Stack\nMode Analytics\nFivetran\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Peter Fishman and Dan Silberman about how they are working to reduce the effort involved in setting up and integrating the various components of the modern data stack at Mozart Data.","date_published":"2021-11-27T15:30:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f1ecdc48-f0f9-47d0-bd8a-ede29dfc1fdb.mp3","mime_type":"audio/mpeg","size_in_bytes":50150388,"duration_in_seconds":3511}]},{"id":"podlove-2021-11-27t11:43:36+00:00-14e50fff919a3b4","title":"Doing DataOps For External Data Sources As A Service at Demyst","url":"https://www.dataengineeringpodcast.com/demyst-external-dataops-platform-episode-241","content_text":"Summary\nThe data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Mark Hookey about Demyst Data, a platform for operationalizing external data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Demyst is and the story behind it?\n\nWhat are the services and systems that you provide for organizations to incorporate external sources in their data workflows?\nWho are your target customers?\n\n\nWhat are some examples of data sets that an organization might want to use in their analytics?\n\nHow are these different from SaaS data that an organization might integrate with tools such as Stitcher and Fivetran?\n\n\nWhat are some of the challenges that are introduced by working with these external data sets?\n\nIf an organization isn’t using Demyst what are some of the technical and organizational systems that they will need to build and manage?\n\n\nCan you describe how the Demyst platform is architected?\n\nWhat have been the most complex or difficult engineering challenges that you have dealt with while building Demyst?\n\n\nGiven the wide variance in the systems that your customers are running, what are some strategies that you have used to provide flexible APIs for accessing the underlying information?\nWhat is the process for you to identify and onboard a new data source in your platform?\nWhat are some of the additional analytical systems that you have to run to manage your business (e.g. usage metering and analytics, etc.)?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Demyst used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Demyst?\nWhen is Demyst the wrong choice?\nWhat do you have planned for the future of Demyst?\n\nContact Info\n\nLinkedIn\nEmail\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nDemyst Data\nLexisNexis\nAWS Athena\nDataRobot\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Demyst founder Mark Hookey about the use cases for external data sources and how they have built a DataOps platform to provide third party data sets as a service.","date_published":"2021-11-27T15:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0f8fdfb5-0e94-43b4-a474-d4a616c86c63.mp3","mime_type":"audio/mpeg","size_in_bytes":43280473,"duration_in_seconds":3556}]},{"id":"podlove-2021-11-20t11:11:20+00:00-c7e24d72c774bdc","title":"Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster","url":"https://www.dataengineeringpodcast.com/dagster-data-platform-big-complexity-episode-239","content_text":"Summary\nThe technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform and blazing fast NVMe storage there’s nothing slowing you down. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Nick Schrock about the evolution of Dagster and its path forward\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Dagster is and the story behind it?\nHow has the project and community changed/evolved since we last spoke 2 years ago?\n\nHow has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem?\n\nWhat do you see as the foundational vs transient complexities that are germane to the industry?\n\n\n\n\nOne of the emerging ideas in Dagster is the \"software defined data asset\" as the central entity in the framework. How has that shifted the way that engineers approach pipeline design and composition?\n\nHow did that conceptual shift inform the accompanying refactor of the core principles in the framework? (jobs, ops, graphs)\n\n\nOne of the powerful elements of the Dagster framework is the investment in rich metadata as a foundational principle. What are the opportunities for integrating and extending that context throughout the rest of an organizations data platform?\n\nWhat do you see as the potential for efforts such as OpenLineage and OpenMetadata to allow for other components in the data platform to create and propagate that context more freely?\n\n\nWhat are some of the project architecture/repository structure/pipeline composition patterns that have begun to form in the community and your own internal work with Dagster?\n\nWhat are some of the anti-patterns that you have seen users fall into when working with Dagster?\n\n\nAlong with your recent refactoring of the core API you have also started to roll out the Dagster Cloud offering. What was your process for determining the path to commercialization for the Dagster project and community?\n\nHow are you managing governance and long-term viability of the open source elements of Dagster?\nWhat are your design principles for deciding the boundaries between OSS and commercial features?\n\n\nWhat do you see as the role of Dagster in the creation of a data platform architecture?\n\nWhat are the opportunities that it creates for data platform engineers?\n\n\nWhat is your perspective on the tradeoffs of pipelines as software vs. pipelines as \"code\" vs. low/no-code pipelines?\n\nWhat (if any) option do you see for language agnostic/multi-language pipeline definitions in Dagster?\n\n\nWhat do you see as the biggest threats to the future success of Dagster/Elementl?\nYou were a relative outsider to the data ecosystem when you first started Dagster/Elementl. What have been the most interesting and surprising experiences as you have invested your time and energy in contributing to the community?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Dagster used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Dagster?\nWhen is Dagster the wrong choice?\nWhat do you have planned for the future of Dagster?\n\nContact Info\n\nLinkedIn\n@schrockn on Twitter\nschrockn on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nElementl\n\nSeries A Announcement\n\n\nVideo on software-defined assets\nDagster\n\nPodcast Episode\n\n\nGraphQL\ndbt\n\nPodcast Episode\n\n\nOpen Source Data Stack Conference\nMeltano\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nDataHub\n\nPodcast Episode\n\n\nHashicorp\nVercel\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nick Schrock about how the Dagster framework is focusing on taming the complexity of data workflows, the introduction of Dagster Cloud for reducing the operational burden, and his philosophy on the boundaries for commercial and open source features going forward.","date_published":"2021-11-20T06:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cec44fef-5452-445b-9bc5-516a625160fe.mp3","mime_type":"audio/mpeg","size_in_bytes":50246187,"duration_in_seconds":3925}]},{"id":"podlove-2021-11-20t11:35:42+00:00-688a436eda38aa8","title":"Exploring Processing Patterns For Streaming Data Integration In Your Data Lake","url":"https://www.dataengineeringpodcast.com/upsolver-streaming-data-integration-episode-240","content_text":"Summary\nOne of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the state of the market for data lakes today?\n\nWhat are the prevailing architectural and technological patterns that are being used to manage these systems?\n\n\nBatch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes?\nWhat are the challenges presented by streaming approaches to data transformations?\n\nThe batch model for processing is intuitive despite its latency problems. What are the benefits that it provides?\n\n\nThe core concept for data orchestration is the DAG. How does that manifest in a streaming context?\nIn batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations?\nWhat are some of the data processing/integration patterns that are impossible in a batch system?\nWhat are some useful strategies for migrating from a purely batch, or hybrid batch and streaming architecture, to a purely streaming system?\n\nWhat are some of the changes in technological or organizational patterns that are often overlooked or misunderstood in this shift?\n\n\nWhat are some of the most surprising things that you have learned about streaming systems in your time at Upsolver?\nWhat are the most interesting, innovative, or unexpected ways that you have seen streaming architectures used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on streaming data integration?\nWhen are streaming architectures the wrong approach?\nWhat do you have planned for the future of Upsolver to make streaming data easier to work with?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nUpsolver\nHive Metastore\nHudi\n\nPodcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nHadoop\nLambda Architecture\nKappa Architecture\nApache Beam\nEvent Sourcing\nFlink\n\nPodcast Episode\n\n\nSpark Structured Streaming\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ori Rafael about the benefits of streaming data integration for data lake analytics and how to design your pipelines when migrating from a batch oriented mindset.","date_published":"2021-11-20T06:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4a096e03-34ab-4360-b06f-0e7ab73ce613.mp3","mime_type":"audio/mpeg","size_in_bytes":42658474,"duration_in_seconds":3173}]},{"id":"podlove-2021-11-14t12:24:09+00:00-20861518621a471","title":"Data Quality Starts At The Source","url":"https://www.dataengineeringpodcast.com/databand-proactive-data-quality-episode-238","content_text":"Summary\nThe most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Michael Harper about definitions of data quality and where to define and enforce it in the data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is your definition for the term \"data quality\" and what are the implied goals that it embodies?\n\nWhat are some ways that different stakeholders and participants in the data lifecycle might disagree about the definitions and manifestations of data quality?\n\n\nThe market for \"data quality tools\" has been growing and gaining attention recently. How would you categorize the different approaches taken by open source and commercial options in the ecosystem?\n\nWhat are the tradeoffs that you see in each approach? (e.g. data warehouse as a chokepoint vs quality checks on extract)\n\n\nWhat are the difficulties that engineers and stakeholders encounter when identifying and defining information that is necessary to identify issues in their workflows?\nCan you describe some examples of adding data quality checks to the beginning stages of a data workflow and the kinds of issues that can be identified?\n\nWhat are some ways that quality and observability metrics can be aggregated across multiple pipeline stages to identify more complex issues?\n\n\nIn application observability the metrics across multiple processes are often associated with a given service. What is the equivalent concept in data platform observabiliity?\nIn your work at Databand what are some of the ways that your ideas and assumptions around data quality have been challenged or changed?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Databand used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working at Databand?\nWhen is Databand the wrong choice?\nWhat do you have planned for the future of Databand?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDataband\nClean Architecture (affiliate link)\nGreat Expectations\nDeequ\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Michael Harper about the benefits of being proactive about data quality efforts and building expectations and metrics into every stage of your pipelines, from source to destination.","date_published":"2021-11-14T07:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b48e6f00-2093-46c6-851c-0c92ee723acf.mp3","mime_type":"audio/mpeg","size_in_bytes":43519730,"duration_in_seconds":3534}]},{"id":"podlove-2021-11-10t22:51:27+00:00-39d928088059cd5","title":"Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata","url":"https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237","content_text":"Summary\nA significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Sriharsha Chintalapani and Suresh Srinivas about OpenMetadata, an open standard for metadata and a reference implementation for a central metadata store\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what the OpenMetadata project is and the story behind it?\n\nWhat are the goals of the project?\n\n\nWhat are the common challenges faced by engineers and data practitioners in organizing the metadata for their systems?\nWhat are the capabilities that a centralized and holistic view of a platform’s metadata can enable?\nHow would you characterize the current state and progress on the open source initiative around OpenMetadata?\nHow does OpenMetadata compare to the OpenLineage project and other similar systems?\n\nWhat opportunities do you see for collaborating with or learning from their efforts?\n\n\nWhat are the schema elements that you have identified as critical to a holistic view of an organization’s metadata?\nFor an organization with an existing data platform, what is the role that OpenMetadata plays, and what are the points of integration across the different components?\nCan you describe the implementation of the OpenMetadata architecture?\n\nWhat are the user experience and operational characteristics that you are trying to optimize for as you iterate on the project?\n\n\nWhat are the challenges that you face in balancing the generality and specificity of the core schemas for metadata objects?\nThere are a large and growing number of businesses that create systems on top of an organizations metadata in the form of catalogs, observability, governance, data quality, etc. What do you see as the role of the OpenMetadata project across that ecosystem of products?\nHow has your perspective on the domain of metadata management and the associated challenges changed or evolved as you have been working on this project?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata?\nWhen is OpenMetadata the wrong choice?\nWhat do you have planned for the future of OpenMetadata?\n\nContact Info\n\nSuresh\n\nLinkedIn\n@suresh_m_s on Twitter\nsureshms on GitHub\n\n\nSriharsha\n\nLinkedIn\nharshach on GitHub\n@d3fmacro on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nOpenMetadata\nApache Storm\nApache Kafka\nHortonworks\nApache Atlas\nOpenMetadata Sandbox\nOpenLineage\n\nPodcast Episode\n\n\nEgeria\nJSON Schema\nAmundsen\n\nPodcast Episode\n\n\nDataHub\n\nPodcast Episode\n\n\nJanusGraph\nTitan Graph Database\nHBase\nJetty\nDropWizard\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the OpenMetadata project and how it can provide a universal metadata layer for your whole data environment through common schema definitions and a simple architecture","date_published":"2021-11-10T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/212e85a0-f13a-49e6-88fb-2985cd56515f.mp3","mime_type":"audio/mpeg","size_in_bytes":54850339,"duration_in_seconds":4014}]},{"id":"podlove-2021-11-06t21:21:43+00:00-799e689a702d8bd","title":"Business Intelligence Beyond The Dashboard With ClicData","url":"https://www.dataengineeringpodcast.com/clicdata-cloud-business-intelligence-episode-236","content_text":"Summary\nBusiness intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Telmo Silva about ClicData,\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what ClicData is and the story behind it?\nHow would you characterize the current state of the market for business intelligence?\n\nWhat are the systems/capabilities that are required to run a full-featured BI system?\n\n\nWhat are the challenges that businesses face in developing in-house capacity for business intelligence?\nCan you describe how the ClicData platform is architected?\n\nHow has it changed or evolved since you first began working on it?\n\n\nHow are you approaching schema design and evolution in the storage layer?\nHow do you handle questions of data security/privacy/regulations given that you are storing the information on behalf of the business?\nIn your work with clients what are some of the challenges that businesses are facing when attempting to answer questions and gain insights from their data in a repeatable fashion?\n\nWhat are some strategies that you have found useful for structuring schemas or dashboards to make iterative exploration of data effective?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen ClicData used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on ClicData?\nWhen is ClicData the wrong choice?\nWhat do you have planned for the future of ClicData?\n\nContact Info\n\nLinkedIn\n@telmo_clicdata on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nClicData\nTableau\nSuperset\n\nPodcast Episode\n\n\nPentaho\nD3.js\nInformatica\nTalend\nTIBCO Spotfire\nLooker\n\nPodcast Episode\n\n\nBullet Chart\nPostgreSQL\n\nPodcast Episode\n\n\nAzure\nCrystal Reports\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Business intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Telmo Silva about all of the layers involved in a full featured business intelligence system and how he created ClicData to make them available to organizations of every size.","date_published":"2021-11-06T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7b3dd6bd-0480-4b04-bf15-27fa0007b8f4.mp3","mime_type":"audio/mpeg","size_in_bytes":44034319,"duration_in_seconds":3720}]},{"id":"podlove-2021-11-05t01:50:07+00:00-387ab86db24371e","title":"Exploring The Evolution And Adoption of Customer Data Platforms and Reverse ETL","url":"https://www.dataengineeringpodcast.com/reverse-etl-and-customer-data-platforms-episode-235","content_text":"Summary\nThe precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your organization they drove a wave of analytics about how to improve products based on actual usage data. A natural outgrowth of that capability is the more recent growth of reverse ETL systems that use those analytics to feed back into the operational systems used to engage with the customer. In this episode Tejas Manohar and Rachel Bradley-Haas share the story of their own careers and experiences coinciding with these trends. They also discuss the current state of the market for these technological patterns and how to take advantage of them in your own work.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Go to dataengineeringpodcast.com/montecarlo and start trusting your data with Monte Carlo today!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Rachel Bradley-Haas and Tejas Manohar about the combination of operational analytics and the customer data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan we start by discussing what it means to have a \"customer data platform\"?\nWhat are the challenges that organizations face in establishing a unified view of their customer interactions?\n\nHow do the presence of multiple product lines impact the ability to understand the relationship with the customer?\n\n\nWe have been building data warehouses and business intelligence systems for decades. How does the idea of a CDP differ from the approaches of those previous generations?\nA recent outgrowth of the focus on creating a CDP is the introduction of \"operational analytics\", which was initially termed \"reverse ETL\". What are your opinions on the semantics and importance of these names?\n\nWhat is the relationship between a CDP and operational analytics? (can you have one without the other?)\n\n\nHow have the capabilities of operational analytics systems changed or evolved in the past couple of years?\n\nWhat new use cases or capabilities have been unlocked as a result of these changes?\n\n\nWhat are the opportunities over the medium to long term for operational analytics and customer data platforms?\nWhat are the most interesting, innovative, or unexpected ways that you have seen operational analytics and CDPs used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on operational analytics?\nWhen is a CDP the wrong choice?\nWhat other industry trends are you keeping an eye on? What do you anticipate will be the next breakout product category?\n\nContact Info\n\nRachel\n\nLinkedIn\n\n\nTejas\n\nLinkedIn\n@tejasmanohar on Twitter\ntejasmanohar on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\n\nLinks\n\nBig-Time Data\nHightouch\n\nPodcast Episode\n\n\nSegment\n\nPodcast Episode\n\n\nCustomer Data Platform\nTreasure Data\nRudderstack\nAirflow\nDBT Cloud\nFivetran\n\nPodcast Episode\n\n\nStitch\nPLG == Product Led Growth\nABM == Account Based Marketing\nMaterialize\n\nPodcast Episode\n\n\nTransform\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your organization they drove a wave of analytics about how to improve products based on actual usage data. A natural outgrowth of that capability is the more recent growth of reverse ETL systems that use those analytics to feed back into the operational systems used to engage with the customer. In this episode Tejas Manohar and Rachel Bradley-Haas share the story of their own careers and experiences coinciding with these trends. They also discuss the current state of the market for these technological patterns and how to take advantage of them in your own work.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tejas Manohar of Hightouch and Rachel Bradley-Haas of Big-Time Data about how the growth of customer data platforms led to the introduction of reverse ETL systems and how you can use them together to improve your customer experience.","date_published":"2021-11-04T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7bbfee29-96cb-48c1-be1a-fa6950984fa0.mp3","mime_type":"audio/mpeg","size_in_bytes":50469260,"duration_in_seconds":3726}]},{"id":"podlove-2021-10-29t10:29:33+00:00-a9e0ee048aaf6ac","title":"Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator","url":"https://www.dataengineeringpodcast.com/narrator-exploratory-analytics-episode-234","content_text":"Summary\nThe perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the \"activity schema\" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Narrator is and the story behind it?\nWhat are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability?\nWhat are the use cases that you are focused on?\nHow does Narrator fit within the data workflows of an organization?\nHow is the Narrator platform implemented?\n\nHow has the design and focus of the technology evolved since you first started working on Narrator?\n\n\nThe core element of the analyses that you are building is the \"activity schema\". Can you describe the design process that led you to that format?\n\nWhat are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault?\n\nHow does the activity schema address those challenges?\n\n\n\n\nWhat are the performance characteristics of deriving models from an activity schema/timeseries table?\nFor someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema?\n\nCan you talk through the domain modeling that needs to happen when determining what entities and actions to capture?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Narrator used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator?\nWhen is Narrator the wrong choice?\nWhat do you have planned for the future of Narrator?\n\nContact Info\n\nLinkedIn\n@ae4ai on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nNarrator\nDARPA Challenge\nFivetran\nLuigi\nChartio\nAirflow\nDomain Driven Design\nData Vault\nSnowflake Schema\nEvent Sourcing\nCensus\n\nPodcast Episode\n\n\nHightouch\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ahmed Elsamadisi about how the Narrator platform uses the activity schema to make self service exploratory analytics a seamless experience.","date_published":"2021-10-29T06:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a55918cd-4b20-4a67-92c2-b4a7883504a4.mp3","mime_type":"audio/mpeg","size_in_bytes":55996833,"duration_in_seconds":4128}]},{"id":"podlove-2021-10-29t00:00:22+00:00-805cbdf0fbacc13","title":"Streaming Data Pipelines Made SQL With Decodable","url":"https://www.dataengineeringpodcast.com/decodable-streaming-data-pipelines-sql-episode-233","content_text":"Summary\nStreaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nYour host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Decodable is and the story behind it?\nWho are the target users, and how has that focus informed your prioritization of features at launch?\nWhat are the complexities that data engineers encounter when building pipelines on streaming systems?\nWhat are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.)\n\nHow do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building?\n\n\nCan you describe how you have architected the Decodable platform?\n\nWhat have been the most complex or time consuming engineering challenges that you have dealt with so far?\n\n\nWhat are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems?\nWhat has been your process for designing the interfaces and abstractions that you are exposing to end users?\n\nWhat are some of the leaks in those abstractions that have either started to show or are anticipated?\n\n\nWhat have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Decodable used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?\nWhen is Decodable the wrong choice?\nWhat do you have planned for the future of Decodable?\n\nContact Info\n\nesammer on GitHub\n@esammer on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDecodable\nCloudera\nKafka\nFlink\n\nPodcast Episode\n\n\nSpark\nSnowflake\n\nPodcast Episode\n\n\nBigQuery\nRedShift\nkSQLDB\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nMillwheel Paper\nDremel Paper\nTimely Dataflow\nMaterialize\n\nPodcast Episode\n\n\nSoftware Defined Networking\nData Mesh\n\nPodcast Episode\n\n\nOpenLineage\n\nPodcast Episode\n\n\nDataHub\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Eric Sammer about the difficulty of working with streaming engines at a low level of abstraction and how he and his team at Decodable are working to make development of streaming data pipelines as straightforward as writing SQL","date_published":"2021-10-28T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/90b3d022-ece6-4409-b6f8-e822dd0653a6.mp3","mime_type":"audio/mpeg","size_in_bytes":57679395,"duration_in_seconds":4172}]},{"id":"podlove-2021-10-22t01:00:14+00:00-99d1882d20d0d27","title":"Data Exploration For Business Users Powered By Analytics Engineering With Lightdash","url":"https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232","content_text":"Summary\nThe market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.\nYour host is Tobias Macey and today I’m interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Lightdash is and the story behind it?\n\nWhat are the main goals of the project?\nWho are the target users, and how has that profile informed your feature priorities?\n\n\nBusiness intelligence is a market that has gone through several generational shifts, with products targeting numerous personas and purposes. What are the capabilities that make Lightdash stand out from the other options?\nCan you describe how Lightdash is architected?\n\nHow have the design and goals of the system changed or evolved since you first began working on it?\nWhat have been the most challenging engineering problems that you have dealt with?\n\n\nHow does the approach that you are taking with Lightdash compare to systems such as Transform and Metriql that aim to provide a dedicated metrics layer?\nCan you describe the workflow for someone building an analysis in Lightdash?\n\nWhat are the points of collaboration around Lightdash for different roles in the organization?\n\n\nWhat are the methods that you use to expose information about the state of the underlying dbt models to the end users?\n\nHow do they use that information in their exploration and decision making?\n\n\nWhat was your motivation for releasing Lightdash as open source?\n\nHow are you handling the governance and long-term viability of the project?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Lightdash used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Lightdash?\nWhen is Lightdash the wrong choice?\nWhat do you have planned for the future of Lightdash?\n\nContact Info\n\nLinkedIn\nowlas on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nLightdash\nLooker\n\nPodcast Episode\n\n\nPowerBI\n\nPodcast Episode\n\n\nRedash\n\nPodcast Episode\n\n\nMetabase\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nSuperset\n\nPodcast Episode\n\n\nStreamlit\n\nPodcast Episode\n\n\nKubernetes\nJDBC\nSQLAlchemy\nSQLPad\nSinger\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\nMeltano\n\nPodcast Episode\n\n\nTransform\n\nPodcast Episode\n\n\nMetriql\n\nPodcast Episode\n\n\nCube.js\nOpenLineage\n\nPodcast Episode\n\n\ndbt Packages\nRudderstack\nPostHog\n\nPodcast Interview\n\n\nFirebolt\n\nPodcast Interview\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Oliver Laslett about the open source Lightdash framework for business intelligence and how it builds on the work that your analytics engineers are doing with dbt.","date_published":"2021-10-22T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a93bb51d-0e12-495c-a1f2-859790356ba3.mp3","mime_type":"audio/mpeg","size_in_bytes":57183993,"duration_in_seconds":3962}]},{"id":"podlove-2021-10-20t01:04:46+00:00-60146d21b501c97","title":"Completing The Feedback Loop Of Data Through Operational Analytics With Census","url":"https://www.dataengineeringpodcast.com/census-operational-analytics-episode-231","content_text":"Summary\nThe focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Boris Jabes about Census and the growing category of operational analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Census is and the story behind it?\nThe terms \"reverse ETL\" and \"operational analytics\" have started being used for similar, and often interchangeable, purposes. What are your thoughts on the semantic and concrete differences between these phrases?\nWhat are the motivating factors for adding operational analytics or \"data activation\" to an organization’s data platform?\n\nThis is a nascent but quickly growing market with a number of products and projects operating in the space. How would you characterize the current state of the segment and Census’ position in it?\n\n\nCan you describe how the Census platform is implemented?\n\nWhat are some of the early design choices that have had to be refactored or augmented as you have evolved the product and worked with customers?\nWhat are some of the assumptions that you had about the needs and uses for the platform which have been challenged or changed as you dug deeper into the problem?\n\n\nCan you describe the workflow for a customer adopting Census?\n\nWhat are some of the data modeling practices that make it easier to \"activate\" the organization’s data?\n\n\nAnother recent trend in the data industry is the growth of data quality and data lineage tools. What is involved in using the measured quality or lineage information as a signal in the operational systems, or to prevent a synchronization?\nHow can users test and validate their workflows in Census?\n\nWhat are the options for propagating Census’ runtime information back into lineage and data quality tracking?\n\n\nCensus supports incremental syncs from the warehouse. What are the opportunities for bringing streaming architectures to the space of operational analytics?\n\nWhat are the challenges/complexities in the current set of technologies that act as a barrier?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Census used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Census?\nWhen is Census the wrong choice?\nWhat do you have planned for the future of Census?\n\nContact Info\n\nLinkedIn\nWebsite\n@borisjabes on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCensus\nOperational Analytics\nFivetran\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nLoom\nMaterialize\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Boris Jabes of Census about the growing trend of operational analytics, how it allows data teams to complete the feedback loop for data value, and how the Census platform is architected to make it easy to implement.","date_published":"2021-10-20T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e05c8e8c-be4f-41da-bc32-1f9ac75eba48.mp3","mime_type":"audio/mpeg","size_in_bytes":59335200,"duration_in_seconds":4146}]},{"id":"podlove-2021-10-16t01:00:29+00:00-447a0c7359f90b3","title":"Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data","url":"https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230","content_text":"Summary\nThe binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nYour host is Tobias Macey and today I’m interviewing Shirshanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project DataHub for powering data discovery, data observability and federated data governance.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Acryl Data is and the story behind it?\nHow has your experience of building and running DataHub at LinkedIn informed your product direction for Acryl?\n\nWhat are some lessons that your co-founder Swaroop has contributed from his experience at AirBnB?\n\n\nThe data catalog/discovery/quality market has become very active over the past year. What is your perspective on the market, and what are the gaps that are not yet being addressed?\n\nHow does the focus of Acryl compare to what the team at Metaphor are building?\n\n\nHow has the DataHub project changed in the past year with more companies outside of LinkedIn using and contributing to it?\nWhat are your plans for Data Observability?\nCan you describe the system architecture that you have built at Acryl?\nWhat are the convenience features that you are building to augment the capabilities and integration process for DataHub?\nWhat are some typical workflows that data teams build out when working with Acryl?\nWhat are some examples of automated actions that can be triggered from metadata changes?\n\nWhat are the available events that can be used to trigger actions?\n\n\nWhat are some of the challenges that teams are facing when integrating metadata management and analysis into their data workflows?\nWhat are your thoughts on the potential for the Open Lineage and Open metadata projects?\nHow is the governance of DataHub being managed?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Acryl/DataHub used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Acryl/DataHub?\nWhen is Acryl the wrong choice?\nWhat do you have planned for the future of Acryl?\n\nContact Info\n\nShirshanka\n\nLinkedIn\n@shirshanka on Twitter\nshirshanka on GitHub\n\n\nSwaroop\n\nLinkedIn\n@arudis on Twitter\nswaroopjagadish on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAcryl Data\nDataHub\nHudi\n\nPodcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nApache Gobblin\nAirflow\nSuperset\n\nPodcast Episode\n\n\nCollibra\n\nPodcast Episode\n\n\nAlation\nStrata Conference Presentation\nAcryl/DataHub Ingestion Framework\nJoe Hellerstein\nTrifacta\nDataHub Roadmap\nData Mesh\nOpenLineage\n\nPodcast Episode\n\n\nOpenMetadata\nEgeria Open Metadata\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of Acryl Data about their work to bring DataHub to every organization for more powerful data discovery, data quality management, and data observability.","date_published":"2021-10-15T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fac963a8-ecc0-46db-9ddf-1aa9dd72e022.mp3","mime_type":"audio/mpeg","size_in_bytes":52645888,"duration_in_seconds":4098}]},{"id":"podlove-2021-10-14t00:12:51+00:00-ca3f22e1dcce879","title":"How And Why To Become Data Driven As A Business","url":"https://www.dataengineeringpodcast.com/fail-fast-learn-faster-data-driven-business-episode-229","content_text":"Summary\nOrganizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book \"Fail Fast, Learn Faster\". This is an entertaining and enlightening exploration of the business side of data with an industry veteran.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Randy Bean about his recent book focusing on the use of big data and AI for informing data driven business leadership\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by discussing the focus of the book and what motivated you to write it?\n\nWho is the intended audience, and how did that inform the tone and content?\n\n\nBusinesses and their officers have been aiming to be \"data driven\" for years. In your experience, what are the concrete goals that are implied by that term?\n\nWhat are the barriers that organizations encounter in the pursuit of those goals?\nHow have the success rates (real and imagined) shifted in recent years as the level of sophistication of the tools and industry for data management has increased?\n\n\nWhat is the state of data initiatives in leading corporations today?\nWhat are the biggest opportunities and risks that organizations focus on related to their use of data?\nAt what level(s) of the organization do lessons around data ethics need to be embedded?\nYou have been working with large companies for many years to help them with their adoption of \"big data\". How has your work on this book shifted or clarified your perspectives on the subject?\nWhat are the main lessons or ideas that you hope readers will take away from the book?\nWhat are the most interesting, innovative, or unexpected ways that you have seen big data applied to business?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on this book?\nWhat are your predictions for the next decade of big data and AI?\n\nContact Info\n\n@RandyBeanNVP on Twitter\nLinkedIn\nEmail\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nFail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (affiliate link)\n\nBook Website\n\n\nHarvard Business Review\nMIT Sloan Review\nNew Vantage Partners\nCOBOL\nMoneyball\nWeapons of Math Destruction\nThe Seven Roles of the Chief Data Officer\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book "Fail Fast, Learn Faster". This is an entertaining and enlightening exploration of the business side of data with an industry veteran.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Randy Bean of New Venture Partners about his recent book "Fail Fast, Learn Faster", and why it is more important than ever for businesses to be data driven at every level.","date_published":"2021-10-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ebba9591-a568-4aac-ae0d-4fe1cac8c801.mp3","mime_type":"audio/mpeg","size_in_bytes":50046972,"duration_in_seconds":3719}]},{"id":"podlove-2021-10-08t11:42:45+00:00-74971cfbc4715c7","title":"Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql","url":"https://www.dataengineeringpodcast.com/metriql-open-source-headless-bi-episode-228","content_text":"Summary\nThe key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nYour host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Metriql is and the story behind it?\nWhat are the characteristics and benefits of a \"headless BI\" system?\nWhat was your motivation to create and open-source Metriql as an independent project outside of your business?\n\nHow are you approaching governance and sustainability of the project?\n\n\nHow does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform?\nHow does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI?\n\nWhat are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics?\n\n\nCan you describe how Metriql is implemented?\n\nHow have the design and goals of the project changed or evolved since you began working on it?\nWhat are the most complex/difficult engineering elements of building a metrics layer?\n\n\nCan you describe the workflow of defining metrics?\n\nWhat have been your guiding principles in defining the user experience for working with metriql?\nWhat are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer)\n\n\nWhat are the biggest challenges and limitations of creating metrics definitions purely in SQL?\nWhat are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors?\nWhat are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Metriql used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql?\nWhen is Metriql the wrong choice?\nWhat do you have planned for the future of Metriql?\n\nContact Info\n\nLinkedIn\nWebsite\nburemba on GitHub\n@bu7emba on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nMetriql\nRakam\nHazelcast\nHeadless BI\nGoogle Data Studio\nSuperset\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nTrino\n\nPodcast Episode\n\n\nSupergrain\nThe Missing Piece Of The Modern Data Stack article by Benn Stancil\nMetabase\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\ndbt-metabase\nre_data\nOpenMetadata\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Burak Kabakcı about the open source headless BI system Metriql and how it provides a central system for defining and using key business metrics.","date_published":"2021-10-08T11:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4f02ad0f-c993-4664-a874-dc5c07a3b491.mp3","mime_type":"audio/mpeg","size_in_bytes":33982582,"duration_in_seconds":2617}]},{"id":"podlove-2021-10-06t00:02:22+00:00-d853a38f279846e","title":"Adding Support For Distributed Transactions To The Redpanda Streaming Engine","url":"https://www.dataengineeringpodcast.com/redpanda-distributed-transactions-episode-227","content_text":"Summary\nTransactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you quickly recap what RedPanda is and the goals of the project?\nWhat are the use cases for transactions in a pub/sub messaging system?\n\nWhat are the elements of streaming systems that make atomic transactions a complex problem?\n\n\nWhat was the motivation for starting down the path of adding transactions to the RedPanda engine?\n\nHow did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics?\n\n\nCan you talk through the details of how you ended up implementing transactions in RedPanda?\n\nWhat are some of the roadblocks and complexities that you encountered while working through the implementation?\n\n\nHow did you approach the validation and verification of the transactions?\nWhat other features or capabilities are you planning to work on next?\nWhat are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda?\nWhen are transactions the wrong choice?\nWhat do you have planned for the future of transaction support in RedPanda?\n\nContact Info\n\n@rystsov on twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nVectorized\nRedPanda\n\nPodcast Episode\n\n\nRedPanda Transactions Post\nYandex\nCassandra\nMongoDB\nRiak\nCosmos DB\nJepsen\n\nPodcast Episode\n\n\nTesting Shared Memories paper\nJournal of Systems Research\nKafka\nPulsar\nSeastar Framework\nCockroachDB\n\nPodcast Episode\n\n\nTiDB\nCalvin Paper\nPolyjuice Paper\nParallel Commit\nChaos Testing\nMatchmaker Paxos Algorithm\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Denis Rystsov about how he designed and implemented support for distributed transactions in the Redpanda streaming engine.","date_published":"2021-10-05T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c1cb4fe6-3bcd-4f6a-adbb-9548a28aa955.mp3","mime_type":"audio/mpeg","size_in_bytes":42787452,"duration_in_seconds":2758}]},{"id":"podlove-2021-10-02t19:40:14+00:00-bc9a12c6cd65c64","title":"Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike","url":"https://www.dataengineeringpodcast.com/aerospike-real-time-data-platform-episode-226","content_text":"Summary\nAerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nYour host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Aerospike is and the story behind it?\n\nWhat are the use cases that it is uniquely well suited for?\nWhat are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience?\n\n\nWhat are the driving factors for building a real-time data platform?\nHow is Aerospike being incorporated in application and data architectures?\nCan you describe how the Aerospike engine is architected?\n\nHow have the design and architecture changed or evolved since it was first created?\nHow have market forces influenced the product priorities and focus?\n\n\nWhat are the challenges that end users face when determining how to model their data given a key/value storage interface?\n\nWhat are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures?\n\n\nWhat are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.)\nWhat are the most interesting, innovative, or unexpected ways that you have seen Aerospike used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike?\nWhen is Aerospike the wrong choice?\nWhat do you have planned for the future of Aerospike?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAerospike\n\nGitHub\n\n\nEnterpriseDB\n\"Nobody Expects The Spanish Inquisition\"\nARM CPU Architectures\nAWS Graviton Processors\nThe Datacenter Is The Computer (Affiliate link)\nJepsen Tests\n\nPodcast Episode\n\n\nCloud Native Computing Foundation\nPrometheus\nGrafana\nOpenTelemetry\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Aerospike database engine provides a foundation for building real-time data platforms that work at terabyte to petabyte scale.","date_published":"2021-10-02T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d426996f-edad-4fc4-957d-e8eb2793a13b.mp3","mime_type":"audio/mpeg","size_in_bytes":43985344,"duration_in_seconds":4058}]},{"id":"podlove-2021-09-29t00:42:02+00:00-b666b18f76256cb","title":"Delivering Your Personal Data Cloud With Prifina","url":"https://www.dataengineeringpodcast.com/prifina-personal-data-cloud-episode-225","content_text":"Summary\nThe promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nYour host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Prifina is and the story behind it?\n\nWhat are the primary goals of Prifina?\n\n\nThere has been a lof of interest in the \"quantified self\" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable?\nWhat are some of the personalized applications for this data that have been most compelling or that users are most interested in?\nWhat are the sources of complexity that you are facing when managing access/privacy of user’s data?\nCan you describe the architecture of the platform that you are building?\n\nWhat are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible?\nWhat are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers?\n\n\nHow do you approach schema definition/management for developers to have a stable implementation target?\n\nHow has that schema evolved as you introduced new data sources?\n\n\nWhat are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina?\nWhat are the potential threats that you anticipate for users gaining and maintaining control of their own data?\n\nWhat are the untapped opportunities?\n\n\nWhat are the topics where you have had to invest the most in user education?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Prifina used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina?\nWhen is Prifina the wrong choice?\nWhat do you have planned for the future of Prifina?\n\nContact Info\n\nLinkedIn\n@mmlampinen on Twitter\nmmlampinen on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nPrifina\nGoogle Takeout\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about how the team at Prifina are building a platform that puts users in control of their own data and lets developers build easy to use experiences that are powered by that rich information.","date_published":"2021-09-29T20:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2d40b8b5-dca4-4897-8956-09be571342c5.mp3","mime_type":"audio/mpeg","size_in_bytes":56851531,"duration_in_seconds":4331}]},{"id":"podlove-2021-09-26t01:06:32+00:00-88b3b899a526002","title":"Digging Into Data Reliability Engineering","url":"https://www.dataengineeringpodcast.com/data-reliability-engineering-episode-224","content_text":"Summary\nThe accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nYour host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat does the term \"Data Reliability Engineering\" mean?\nWhat is encompassed under the umbrella of Data Reliability Engineering?\n\nHow does it compare to the concepts from site reliability engineering?\nIs DRE just a repackaged version of DataOps?\n\n\nWhy is Data Reliability Engineering particularly important now?\nWho is responsible for the practice of DRE in an organization?\nWhat are some areas of innovation that teams are focusing on to support a DRE practice?\nWhat are the tools that teams are using to improve the reliability of their data operations?\nWhat are the organizational systems that need to be in place to support a DRE practice?\n\nWhat are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy?\n\n\nWhat are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering?\nIs Data Reliability Engineering ever the wrong choice?\nWhat do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering?\n\nContact Info\n\nFind us at bigeye.com or reach out to us at hello@bigeye.com\nYou can find Egor on LinkedIn or email him at egor@bigeye.com\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nBigeye\n\nPodcast Episode\n\n\nVertica\nLooker\n\nPodcast Episode\n\n\nSite Reliability Engineering\nStemma\n\nPodcast Episode\n\n\nCollibra\n\nPodcast Episode\n\n\nOpenLineage\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about the parallels between data reliability engineering and site reliability engineering, how they differ, and steps that you can take to start adopting data reliability engineering practices in your own teams.","date_published":"2021-09-25T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/535637d8-23c5-40d3-abcd-7cd3d562b8a7.mp3","mime_type":"audio/mpeg","size_in_bytes":49800840,"duration_in_seconds":3487}]},{"id":"podlove-2021-09-25t00:21:37+00:00-bb0658550b71a58","title":"Massively Parallel Data Processing In Python Without The Effort Using Bodo","url":"https://www.dataengineeringpodcast.com/bodo-parallel-data-processing-python-episode-223","content_text":"Summary\nPython has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nYour host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Bodo is and the story behind it?\nWhat are the techniques/technologies that teams might use to optimize or scale out their data processing workflows?\nWhy have you focused your efforts on the Python language and toolchain?\n\nDo you see any potential for expanding into other language communities?\nWhat are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?\n\n\nMany people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?\n\nWhat are the tradeoffs of HPC vs scale-out distributed systems?\n\n\nCan you describe the technical implementation of the Bodo platform?\n\nWhat are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?\n\nHow do you handle compiled extensions? (e.g. C/C++/Fortran)\n\n\nWhat are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation?\n\n\nHow do you handle data distribution for scale out computation?\nWhat are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization?\nWhat are some of the educational challenges that you have run into while working with potential and current customers?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Bodo used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?\nWhen is Bodo the wrong choice?\nWhat do you have planned for the future of Bodo?\n\nContact Info\n\nLinkedIn\n@EhsanTn on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nBodo\nHigh Performance Computing (HPC)\nUniversity of Illinois, Urbana-Champaign\nJulia Language\nPandas\n\nPodcast.__init__ Episode\n\n\nNumPy\nDask\n\nPodcast Episode\n\n\nRay\n\nPodcast.__init__ Episode\n\n\nNumba\nLLVM\nSPMD\nMPI\nElastic Fabric Adapter\nIceberg Table Format\n\nPodcast Episode\n\n\nIPython Parallel\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads","date_published":"2021-09-24T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ff794ce2-4753-40d3-bc86-508149813964.mp3","mime_type":"audio/mpeg","size_in_bytes":47163121,"duration_in_seconds":3856}]},{"id":"podlove-2021-09-19t21:06:45+00:00-e7096b9fdec4432","title":"Declarative Machine Learning Without The Operational Overhead Using Continual","url":"https://www.dataengineeringpodcast.com/continual-declarative-machine-learning-episode-222","content_text":"Summary\nBuilding, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Continual is and the story behind it?\n\nWhat is your definition for \"operational AI\" and how does it differ from other applications of ML/AI?\n\n\nWhat are some example use cases for AI in an operational capacity?\n\nWhat are the barriers to adoption for organizations that want to take advantage of predictive analytics?\n\n\nWho are the target users of Continual?\nCan you describe how the Continual platform is implemented?\n\nHow has the design and infrastructure changed or evolved since you first began working on it?\n\n\nWhat is the workflow for someone building a model and putting it into production?\n\nOnce a model has been deployed, what are the mechanisms that you expose for interacting with it?\n\n\nHow does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery?\nHow much understanding of ML/AI principles is necessary for someone to create a model with Continual?\nWhat is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Continual used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual?\nWhen is Continual the wrong choice?\nWhat do you have planned for the future of Continual?\n\nContact Info\n\nLinkedIn\n@tristanzajonc on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nContinual\nWorld Bank\nSAS\nSPSS\nStata\nFeature Store\nDataRobot\nTransfer Learning\ndbt\n\nPodcast Episode\n\n\nLudwig\nOverton (Apple)\nHightouch\nCensus\nGalaxy Schema\nIn-Database ML Podcast Episode\nscikit-learn\nSnorkel\n\nPodcast Episode\n\n\nMaterialize\n\nPodcast Episode\n\n\nFlink SQL\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tristan Zajonc about his work at Continual to make declarative machine learning workflows possible and seamless by building on top of the data warehouse, and how it reduces the time and cost of putting machine learning into production.","date_published":"2021-09-19T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2f7f42ff-e251-46b1-a7e6-50c674d38a5a.mp3","mime_type":"audio/mpeg","size_in_bytes":53725513,"duration_in_seconds":4311}]},{"id":"podlove-2021-09-19t20:35:03+00:00-9e402c7af192cbd","title":"An Exploration Of The Data Engineering Requirements For Bioinformatics","url":"https://www.dataengineeringpodcast.com/bioinformatics-data-engineering-episode-221","content_text":"Summary\nBiology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nStruggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!\nYour host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nHow did you get into the field of bioinformatics?\nCan you describe what is unique about data needs in bioinformatics?\nWhat are some of the problems that you have found yourself regularly solving for your clients?\nWhen building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.)\nCan you describe a typical set of technologies that you implement when working on a new project?\n\nWhat kinds of systems do you need to integrate with?\n\n\nWhat are the data formats that are widely used for bioinformatics?\n\nWhat are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis?\n\n\nWhat amount of domain expertise is necessary for a data engineer to work in life sciences?\nWhat are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects?\nWhat are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics?\n\nContact Info\n\nLinkedIn\njerowe on GitHub\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nBioinformatics\nHow Perl Saved The Human Genome Project\nNeo4J\nAWS Parallel Cluster\nDatashader\nR Shiny\nPlotly Dash\nApache Parquet\nDask\n\nPodcast Episode\n\n\nHDF5\nSpark\nSuperset\n\nData Engineering Podcast Episode\nPodcast.__init__ Episode\n\n\nFastQ file format\nBAM (Binary Alignment Map) File\nVariant Call Format (VCF)\nHIPAA\nDVC\n\nPodcast Episode\n\n\nLakeFS\nBioThings API\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Jillian Rowe about the data engineering and data infrastructure needs that exist in the field of bioinformatics.","date_published":"2021-09-19T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4f74f84a-1b7a-4790-b342-a326589e688b.mp3","mime_type":"audio/mpeg","size_in_bytes":40378235,"duration_in_seconds":3309}]},{"id":"podlove-2021-09-12t01:05:36+00:00-0ec5cf386e241af","title":"Setting The Stage For The Next Chapter Of The Cassandra Database","url":"https://www.dataengineeringpodcast.com/cassandra-global-scale-database-episode-220","content_text":"Summary\nThe Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Ben Bromhead about the recent release of Cassandra version 4 and how it fits in the current landscape of data tools\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nFor anyone who isn’t familiar with Cassandra, can you briefly describe what it is and some of the story behind it?\n\nHow did you get involved in the Cassandra project and how would you characterize your role?\n\n\nWhat are the main use cases and industries where someone is likely to use Cassandra?\nWhat is notable about the version 4 release?\n\nWhat were some of the factors that contributed to the long delay between versions 3 and 4? (2015 – 2021)\nWhat are your thoughts on the ongoing utility/benefits of projects such as ScyllaDB, particularly in light of the most recent release?\n\n\nCassandra is primarily used as a system of record. What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra?\nThe architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. What do you see as the opportunities for Cassandra over the near to medium term as the cloud continues to grow in prominence?\nWhat are some of the challenges that you and the Cassandra community have faced with the flurry of new data storage and processing systems that have popped up over the past few years?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Cassandra used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Cassandra?\nWhen is Cassandra the wrong choice?\nWhat is in store for the future of Cassandra?\n\nContact Info\n\nLinkedIn\n@benbromhead on Twitter\nbenbromhead on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCassandra\nInstaclustr\nHBase\nDynamoDB Whitepaper\nProperty Based Testing\nQuickTheories\nRiak\nFoundationDB\n\nPodcast Episode\n\n\nScyllaDB\n\nPodcast Episode\n\n\nYugabyteDB\n\nPodcast Episode\n\n\nAzure CosmoDB\nAmazon Keyspaces\nNetty\nKafka\nCQRS == Command Query Responsibility Segregation\nElasticsearch\nRedis\nMemcached\nDebezium\n\nPodcast Episode\n\n\nCDC == Change Data Capture\n\nPodcast Episodes\n\n\nBigtable White Paper\nCockroachDB\n\nPodcast Episode\n\n\nVitess\nCAP Theorem\nPaxos\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Ben Bromhead about the technical and community efforts that went into the latest release of the Cassandra database and the foundation that it has laid for the future of the project.","date_published":"2021-09-12T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/707c508e-aace-4de6-b821-bee6e085ee2e.mp3","mime_type":"audio/mpeg","size_in_bytes":46499607,"duration_in_seconds":3568}]},{"id":"podlove-2021-09-09t02:14:22+00:00-2fbb606b7209279","title":"A View From The Round Table Of Gartner's Cool Vendors","url":"https://www.dataengineeringpodcast.com/gartner-cool-vendors-in-data-2021-episode-219","content_text":"Summary\nGartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nHave you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.\nYour host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as \"cool vendors\" by Gartner.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you each describe what you view as the biggest challenge facing data professionals?\nWho are you building your solutions for and what are the most common data management problems are you all solving?\nWhat are different components of Data Management and why is it so complex?\nWhat will simplify this process, if any?\nThe report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers?\nHow has the data management space changed in recent times? Describe the current data management landscape and any key developments.\nFrom your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases?\nGartner imagines a future where data and analytics leaders need to be prepared to rely on data management solutions that make heterogeneous, distributed data appear consolidated, easy to access and business friendly. How does this tally with your vision of the future of data management and what needs to happen to make this a reality?\nWhat are the most interesting, innovative, or unexpected ways that you have seen your respective products used (in isolation or combined)?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on your respective platforms?\nWhat are the upcoming trends and challenges that you are keeping a close eye on?\n\nContact Info\n\nSaket\n\nLinkedIn\n@saketsaurabh on Twitter\n\n\nMaarten\n\nLinkedIn\n@masscheleinm on Twitter\n\n\nDan\n\nLinkedIn\n\n\nAkshay\n\nWebsite\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nNexla\nSoda\nTada\nTimbr\nCollibra\n\nPodcast Episode\n\n\nGartner Cool Vendors\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the leaders of the companies identified as Gartner's "Cool Vendors" in data for 2021 about the challenges faced by companies and data professionals and the work that they are doing to address those difficulties.","date_published":"2021-09-08T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1b0e60c9-2ebf-4b2d-92a5-0ae9c412601d.mp3","mime_type":"audio/mpeg","size_in_bytes":48218482,"duration_in_seconds":3856}]},{"id":"podlove-2021-09-04t00:52:22+00:00-3359c407c533af2","title":"Designing And Building Data Platforms As A Product","url":"https://www.dataengineeringpodcast.com/data-platform-as-a-product-discussion-panel-episode-218","content_text":"Summary\nThe term \"data platform\" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Lior Gavish, Lior Solomon, and Atul Gupte about the technical, social, and architectural aspects of building your data platform as a product for your internal customers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management? – all\nCan we start by establishing a definition of \"data platform\" for the purpose of this conversation?\nWho are the stakeholders in a data platform?\n\nWhere does the responsibility lie for creating and maintaining (\"owning\") the platform?\n\n\nWhat are some of the technical and organizational constraints that are likely to factor into the design and execution of the platform?\nWhat are the minimum set of requirements necessary to qualify as a platform? (as opposed to a collection of discrete components)\n\nWhat are the additional capabilities that should be in place to simplify the use and maintenance of the platform?\n\n\nHow are data platforms managed? Are they managed by technical teams, product managers, etc.? What is the profile for a data product manager? – Atul G.\nHow do you set SLIs / SLOs with your data platform team when you don’t have clear metrics you’re tracking? – Lior S.\nThere has been a lot of conversation recently about different interpretations of the \"modern data stack\". For a team who is just starting to build out their platform, how much credence should they be giving to those debates?\n\nWhat are the first steps that you recommend for those practitioners?\nIf an organization already has infrastructure in place for data/analytics, how might they think about building or buying their way toward a well integrated platform?\n\n\nOnce a platform is established, what are some challenges that teams should anticipate in scaling the platform?\n\nWhich axes of scale have you found to be most difficult to manage? (scale of infrastructure capacity, scale of organizational/technical complexity, scale of usage, etc.)\nDo we think the \"data platform\" is a skill set? How do we split up the role of the platform? Is there one for real-time? Is there one for ETLs?\nHow do you handle the quality and reliability of the data powering your solution?\n\n\nWhat are helpful techniques that you have used for collecting, prioritizing, and managing feature requests?\n\nHow do you justify the budget and resources for your data platform?\nHow do you measure the success of a data platform?\n\n\nWhat is the relationship between a data platform and data products?\nAre there any other companies you admire when it comes to building robust, scalable data architecture?\nWhat are the most interesting, innovative, or unexpected ways that you have seen data platforms used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and operating a data platform?\nWhen is a data platform the wrong choice? (as opposed to buying an integrated solution, etc.)\nWhat are the industry trends that you are monitoring/excited for in the space of data platforms?\n\nContact Info\n\nLior Gavish\n\nLinkedIn\n@lgavish on Twitter\n\n\nLior Solomon\n\nLinkedIn\n@liorsolomon on Twitter\n\n\nAtul Gupte\n\nLinkedIn\n@atulgupte on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nMonte Carlo\nVimeo\nFacebook\nUber\nZynga\nGreat Expectations\n\nPodcast Episode\n\n\nAirflow\n\nPodcast.__init__ Episode\n\n\nFivetran\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\nModern Data Stack Podcast Episode\nStitch\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A panel discussion about what constitutes a data platform, how to think about designing one from scratch, and ways that you can evolve your existing data infrastructure into a cohesive experience for all of your stakeholders.","date_published":"2021-09-03T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5ba14e60-5942-4759-a996-5b33103fe8d6.mp3","mime_type":"audio/mpeg","size_in_bytes":47524233,"duration_in_seconds":3600}]},{"id":"podlove-2021-09-01t00:50:53+00:00-45e6b139812c379","title":"Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana","url":"https://www.dataengineeringpodcast.com/ahana-presto-cloud-data-lake-episode-217","content_text":"Summary\nThe Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Dipti Borkar, cofounder Ahana about Presto and Ahana, SaaS managed service for Presto\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Ahana is and the story behind it?\nThere has been a lot of recent activity in the Presto community. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data?\n\nWhat is Ahana’s role in the community/ecosystem?\n(happy to skip this question if it’s too contentious) What are some of the notable differences that have emerged over the past couple of years between the Trino (formerly PrestoSQL) and PrestoDB projects?\n\n\nAnother area that has been seeing a lot of activity is data lakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.). How has that influenced your product focus and capabilities?\n\nHow does this activity change the calculus for organizations who are deciding on a lake or warehouse for their data architecture?\n\n\nCan you describe how the Ahana Cloud platform is architected?\n\nWhat are the additional systems that you have built to manage deployment, scaling, and multi-tenancy?\n\n\nBeyond the storage and processing, what are the other notable tools and projects that have become part of the overall stack for supporting open analytics?\nWhat are some areas of ongoing activity that you are keeping an eye on as you build out the Ahana offerings?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Ahana/Presto used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Ahana?\nWhen is Ahana the wrong choice?\nWhat do you have planned for the future of Ahana?\n\nContact Info\n\nLinkedIn\n@dborkar on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAhana\nAlluxio\n\nPodcast Episode\n\n\nCouchbase\nKinetica\nTensorflow\nPyTorch\n\nPodcast.__init__ Episode\n\n\nAWS Athena\nAWS Glue\nHive Metastore\nClickhouse\nDremio\n\nPodcast Episode\n\n\nApache Drill\nTeradata\nSnowflake\n\nPodcast Episode\n\n\nBigQuery\nRaptorX\nAria Optimizations for Presto\nApache Ranger\n\nPresto Plugin\n\n\nTrino\n\nPodcast Episode\n\n\nStarburst\n\nPodcast Episode\n\n\nHive\nIceberg\n\nPodcast Episode\n\n\nHudi\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nSuperset\n\nPodcast.__init__ Episode\nData Engineering Podcast Episode\n\n\nNessie\nLakeFS\nAmundsen\n\nPodcast Episode\n\n\nDataHub\n\nPodcast Episode\n\n\nOtterTune\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Dipti Borkar about how she and her team at Ahana are cutting out the complexity so that you can get your cloud data lake up and running in no time with Presto powering your low latency SQL analytics.","date_published":"2021-09-01T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2ad9d006-65e4-41fc-933c-1670766b4ba9.mp3","mime_type":"audio/mpeg","size_in_bytes":50140691,"duration_in_seconds":3630}]},{"id":"podlove-2021-08-28t00:30:02+00:00-3823e719b1c19fb","title":"Do Away With Data Integration Through A Dataware Architecture With Cinchy","url":"https://www.dataengineeringpodcast.com/cinchy-dataware-platform-episode-216","content_text":"Summary\nThe reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a \"Dataware\" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nHave you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.\nYour host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Cinchy is and the story behind it?\nIn your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration?\nHow is a Dataware platform from a data lake or data warehouses? What is it used for?\nWhat is Zero-Copy Integration? How does that work?\nCan you describe how customers start their Cinchy journey?\nWhat are the main use case patterns that you’re seeing with Dataware?\nYour platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Cinchy used?\nWhen is Cinchy the wrong choice for a customer?\nCan you describe the technical architecture of the Cinchy platform?\nHow do you establish connections/relationships among data from disparate sources?\nHow do you manage schema evolution in source systems?\nWhat are some of the edge cases that users need to consider as they are designing and building those connections?\nWhat are some of the features or capabilities of Cinchy that you think are overlooked or under-utilized?\nHow has your understanding of the problem space changed since you started working on Cinchy?\nHow has the architecture and design of the system evolved to reflect that updated understanding?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Cinchy?\nWhat do you have planned for the future of Cinchy?\n\nContact Info\n\nLinkedIn\n@dandemers on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nCinchy\nGordon Everest\nData Collaboration Alliance\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Cinchy CEO Dan DeMers about the benefits of building your systems with a dataware architecture to eliminate the need for ongoing data integration","date_published":"2021-08-27T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5b2243b2-2a56-4803-9958-07b12103ee85.mp3","mime_type":"audio/mpeg","size_in_bytes":45972786,"duration_in_seconds":3086}]},{"id":"podlove-2021-08-24t00:20:51+00:00-fc15347309ba7d8","title":"Decoupling Data Operations From Data Infrastructure Using Nexla","url":"https://www.dataengineeringpodcast.com/nexla-data-operations-platform-episode-215","content_text":"Summary\nThe technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Nexla is and the story behind it?\nWhat are the major problems that Nexla is aiming to solve?\n\nWhat are the components of a data platform that Nexla might replace?\n\n\nWhat are the use cases and benefits of being able to publish data sets for use outside and across organizations?\nWhat are the different elements involved in implementing DataOps?\nHow is the Nexla platform implemented?\n\nWhat have been the most comple engineering challenges?\nHow has the architecture changed or evolved since you first began working on it?\nWhat are some of the assumptions that you had at the start which have been challenged or invalidated?\n\n\nWhat are some of the heuristics that you have found most useful in generating logical units of data in an automated fashion?\nOnce a Nexset has been created, what are some of the ways that they can be used or further processed?\nWhat are the attributes of a Nexset? (e.g. access control policies, lineage, etc.)\n\nHow do you handle storage and sharing of a Nexset?\n\n\nWhat are some of your grand hopes and ambitions for the Nexla platform and the potential for data exchanges?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Nexla used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Nexla?\nWhen is Nexla the wrong choice?\nWhat do you have planned for the future of Nexla?\n\nContact Info\n\nSaket\n\nLinkedIn\n@saketsaurabh on Twitter\n\n\nAvinash\n\nLinkedIn\n@avinashpuri on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nNexla\nNexsets\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nexla CEO Saket Saurabh and CTO Avinash Shahdadpuri about how they have built a platform for encapsulating your data operations in logical units that abstract away the complexity of data infrastructure.","date_published":"2021-08-25T06:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2738d2d3-32b0-4a35-9d7e-5a715d990bd2.mp3","mime_type":"audio/mpeg","size_in_bytes":48042234,"duration_in_seconds":3468}]},{"id":"podlove-2021-08-20t11:45:40+00:00-45c0764c2a622c9","title":"Let Your Analysts Build A Data Lakehouse With Cuelake","url":"https://www.dataengineeringpodcast.com/cuelake-sql-data-lakehouse-episode-214","content_text":"Summary\nData lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nHave you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.\nYour host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Cuelake is and the story behind it?\nThere are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake?\nWho are the target users of Cuelake and how has that influenced the features and design of the system?\nCan you describe how Cuelake is implemented?\n\nWhat was your selection process for the various components?\n\n\nWhat are some of the sharp edges that you have had to work around when integrating these components?\nWhat involved in getting Cuelake deployed?\nHow are you using Cuelake in your work at Cuebook?\nGiven your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads?\n\nWhat are the advantages that a data lake/lakehouse architecture maintains over a warehouse?\nWhat are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Cuelake used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Cuelake?\nWhen is Cuelake the wrong choice?\nWhat do you have planned for the future of Cuelake?\n\nContact Info\n\nLinkedIn\nvikrantcue on GitHub\n@vkrntd on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nCuelake\nApache Druid\nDremio\nDatabricks\nZeppelin\nSpark\nApache Iceberg\nApache Hudi\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the open source Cuelake project and how it is designed to allow data analysts to build a data lakehouse with SQL","date_published":"2021-08-20T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1c04a83a-974f-4196-b4f6-0367f583723a.mp3","mime_type":"audio/mpeg","size_in_bytes":23369841,"duration_in_seconds":1657}]},{"id":"podlove-2021-08-18t12:27:59+00:00-e98cf8afe81a9cd","title":"Migrate And Modify Your Data Platform Confidently With Compilerworks","url":"https://www.dataengineeringpodcast.com/compilerworks-data-lineage-platform-migration-episode-213","content_text":"Summary\nA major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nSchema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Compilerworks is and the story behind it?\nWhat is a compiler?\n\nHow are you applying compilers to the challenges of data processing systems?\n\n\nWhat are some use cases that Compilerworks is uniquely well suited to?\nThere are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks?\nCan you describe the design and implementation of the Compilerworks platform?\n\nHow has the system changed or evolved since you first began working on it?\n\n\nWhat programming languages and SQL dialects do you currently support?\n\nWhich have been the most challenging to work with?\nHow do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?\n\n\nCan you talk through the process of getting Compilerworks integrated into a customer’s infrastructure?\n\nWhat is a typical workflow for someone using Compilerworks to manage their data lineage?\n\n\nHow does Compilerworks simplify the process of migrating between data warehouses/processing platforms?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Compilerworks used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Compilerworks?\nWhen is Compilerworks the wrong choice?\nWhat do you have planned for the future of Compilerworks?\n\nContact Info\n\n@shevek on GitHub\nWebiste\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCompilerworks\nCompiler\nANSI SQL\nSpark SQL\nGoogle Flume Paper\nSAS\nInformatica\nTrie Data Structure\nSatisfiability Solver\nLisp\nScheme\nSnooker\nQemu Java API\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shevek, CTO of Compilerworks, about how they are using compiler technology to aid in migrating your data processing between platforms and gain insight into your dependencies through advanced data lineage.","date_published":"2021-08-18T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/29320ccc-1cda-4c5b-a0c5-2fd009712ba2.mp3","mime_type":"audio/mpeg","size_in_bytes":55054981,"duration_in_seconds":3969}]},{"id":"podlove-2021-08-15t02:44:07+00:00-43e79a72587cacd","title":"Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop","url":"https://www.dataengineeringpodcast.com/activeloop-unstructured-data-preparation-episode-212","content_text":"Summary\nThe vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nHave you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.\nYour host is Tobias Macey and today I’m interviewing Davit Buniatyan about Activeloop, a platform for hosting and delivering datasets optimized for machine learning\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Activeloop is and the story behind it?\nHow does the form and function of data storage introduce friction in the development and deployment of machine learning projects?\nHow does the work that you are doing at Activeloop compare to vector databases such as Pinecone?\nYou have a focus on image oriented data and computer vision projects. How does the specific applications of ML/DL influence the format and interactions with the data?\nCan you describe how the Activeloop platform is architected?\n\nHow have the design and goals of the system changed or evolved since you began working on it?\n\n\nWhat are the feature and performance tradeoffs between self-managed storage locations (e.g. S3, GCS) and the Activeloop platform?\nWhat is the process for sourcing, processing, and storing data to be used by Hub/Activeloop?\nMany data assets are useful across ML/DL and analytical purposes. What are the considerations for managing the lifecycle of data between Activeloop/Hub and a data lake/warehouse?\nWhat do you see as the opportunity and effort to generalize Hub and Activeloop to support arbitrary ML frameworks/languages?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Activeloop and Hub used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Activeloop?\nWhen is Hub/Activeloop the wrong choice?\nWhat do you have planned for the future of Activeloop?\n\nContact Info\n\nLinkedIn\n@DBuniatyan on Twitter\ndavidbuniat on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nActiveloop\n\nSlack Community\n\n\nPrinceton University\nImageNet\nTensorflow\nPyTorch\n\nPodcast Episode\n\n\nActiveloop Hub\nDelta Lake\n\nPodcast Episode\n\n\nTensor\nWasabi\nRay/Anyscale\n\nPodcast Episode\n\n\nHumans In The Loop podcast\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Davit Buniatyan about his work on Activeloop and the open source Hub framework for reducing the toil involved in getting your unstructured data ready for computer vision and machine learning projects.","date_published":"2021-08-14T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/53234697-0b6e-4110-b0e7-22f1d95711f1.mp3","mime_type":"audio/mpeg","size_in_bytes":39713342,"duration_in_seconds":2919}]},{"id":"podlove-2021-08-10t11:14:40+00:00-eaebc2337aeb7c9","title":"Build Trust In Your Data By Understanding Where It Comes From And How It Is Used With Stemma","url":"https://www.dataengineeringpodcast.com/stemma-data-discovery-episode-211","content_text":"Summary\nAll of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data.\n\nInterview\n\nIntroduction\nCan you describe what Stemma is and the story behind it?\nCan you give me more context into how and why Stemma fits into the current data engineering world? Among the popular tools of today for data warehousing and other products that stitch data together – what is Stemma’s place? Where does it fit into the workflow?\nHow has the explosion in options for data cataloging and discovery influenced your thinking on the necessary feature set for that class of tools? How do you compare to your competitors\nWith how long we have been using data and building systems to analyze it, why do you think that trust in the results is still such a momentous problem?\nTell me more about Stemma and how it compares to Amundsen?\nCan you tell me more about the impact of Stemma/Amundsen to companies that use it?\nWhat are the opportunities for innovating on top of Stemma to help organizations streamline communication between data producers and consumers?\nBeyond the technological capabilities of a data platform, the bigger question is usually the social/organizational patterns around data. How have the \"best practices\" around the people side of data changed in the recent past?\n\nWhat are the points of friction that you continue to see?\n\n\nA majority of conversations around data catalogs and discovery are focused on analytical usage. How can these platforms be used in ML and AI workloads?\nHow has the data engineering world changed since you left Lyft/since we last spoke? How do you see it evolving in the future?\nImagine 5 years down the line and let’s say Stemma is a household name. How have data analysts’ lives improved? Data engineers? Data scientists?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Stemma used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Stemma?\nWhen is Stemma the wrong choice?\nWhat do you have planned for the future of Stemma?\n\nContact Info\n\nLinkedIn\nEmail\n@mark_grover on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nStemma\nAmundsen\n\nPodcast Episode\n\n\nCSAT == Customer Satisfaction\nData Mesh\n\nPodcast Episode\n\n\nFeast open source feature store\nSupergrain\nTransform\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Stemma founder and CEO Mark Grover about how it can be used to establish trust and understanding of your data and how it is being used.","date_published":"2021-08-10T07:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3dac095a-e4d5-4390-8d2a-2d767fe99051.mp3","mime_type":"audio/mpeg","size_in_bytes":44546769,"duration_in_seconds":3156}]},{"id":"podlove-2021-08-07t00:41:45+00:00-4ad67a98c57fb9d","title":"Data Discovery From Dashboards To Databases With Castor","url":"https://www.dataengineeringpodcast.com/castor-data-discovery-platform-episode-210","content_text":"Summary\nEvery organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nHave you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.\nYour host is Tobias Macey and today I’m interviewing Amaury Dumoulin about Castor, a managed platform for easy data cataloging and discovery\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Castor is and the story behind it?\nThe market for data catalogues is nascent but growing fast. What are the broad categories for the different products and projects in the space?\nWhat do you see as the core features that are required to be competitive?\n\nIn what ways has that changed in the past 1 – 2 years?\n\n\nWhat are the opportunities for innovation and differentiation in the data catalog/discovery ecosystem?\n\nHow do you characterize your current position in the market?\n\n\nWho are the target users of Castor?\nCan you describe the technical architecture and implementation of the Castor platform?\n\nHow have the goals and design changed since you first began working on it?\n\n\nCan you talk through the workflow of getting Castor set up in an organization and onboarding the users?\nWhat are the design elements and platform features that allow for serving the various roles and stakeholders in an organization?\nWhat are the organizational benefits that you have seen from users adopting Castor or other data discovery/catalog systems?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Castor used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Castor?\nWhen is Castor the wrong choice?\nWhat do you have planned for the future of Castor?\n\nContact Info\n\nAmaury Dumoulin\nCastor website\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCastor\nAtlan\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nMonte Carlo\n\nPodcast Episode\n\n\nCollibra\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nAirflow\n\nPodcast Episode\n\n\nMetabase\n\nPodcast Episode\n\n\nAirbyte\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Castor platform approaches the problem of data discovery and preserving context for your organization.","date_published":"2021-08-06T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/be2607c1-0cb0-4b65-ae93-7b8c99b6dcab.mp3","mime_type":"audio/mpeg","size_in_bytes":44015215,"duration_in_seconds":3166}]},{"id":"podlove-2021-07-31t03:03:42+00:00-fbeef69390c166b","title":"Charting A Path For Streaming Data To Fill Your Data Lake With Hudi","url":"https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209","content_text":"Summary\nData lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Hudi is and the story behind it?\nWhat are the use cases that it is focused on supporting?\nThere have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.?\nCan you describe how Hudi is architected?\n\nHow have the goals and design of Hudi changed or evolved since you first began working on it?\nIf you were to start the whole project over today, what would you do differently?\n\n\nCan you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment?\nOne of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?\n\nHow does Hudi make that a tractable problem?\n\n\nWhat are the data platform components that are needed to support an installation of Hudi?\nWhat is involved in migrating an existing data lake to use Hudi?\n\nHow would someone approach supporting heterogeneous table formats in their lake?\n\n\nAs someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Hudi used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi?\nWhen is Hudi the wrong choice?\nWhat do you have planned for the future of Hudi?\n\nContact Info\n\nLinkedin\nTwitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nHudi Docs\nHudi Design & Architecture\nIncremental Processing\nCDC == Change Data Capture\n\nPodcast Episodes\n\n\nOracle GoldenGate\nVoldemort\nKafka\nHadoop\nSpark\nHBase\nParquet\nIceberg Table Format\n\nData Engineering Episode\n\n\nHive ACID\nApache Kudu\n\nPodcast Episode\n\n\nVertica\nDelta Lake\n\nPodcast Episode\n\n\nOptimistic Concurrency Control\nMVCC == Multi-Version Concurrency Control\nPresto\nFlink\n\nPodcast Episode\n\n\nTrino\n\nPodcast Episode\n\n\nGobblin\nLakeFS\n\nPodcast Episode\n\n\nNessie\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Hudi project and how it allows for integrating streaming data sources into analytical queries across your data lake.","date_published":"2021-08-03T06:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6c0c2e83-7dfa-4440-885d-e03f63fd41e0.mp3","mime_type":"audio/mpeg","size_in_bytes":54757986,"duration_in_seconds":4176}]},{"id":"podlove-2021-07-30t01:54:29+00:00-5c4c4807e95972b","title":"Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar","url":"https://www.dataengineeringpodcast.com/selectstar-data-discovery-platform-episode-208","content_text":"Summary\nCompanies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Shinji Kim about SelectStar, an intelligent data discovery platform that helps you understand your data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SelectStar is and the story behind it?\nWhat are the core challenges that organizations are facing around data cataloging and discovery?\nThere has been a surge in tools and services for metadata collection, data catalogs, and data collaboration. How would you characterize the current state of the ecosystem?\n\nWhat is SelectStar’s role in the space?\n\n\nWho are your target customers and how does that shape your prioritization of features and the user experience design?\nCan you describe how SelectStar is architected?\n\nHow have the goals and design of the platform shifted or evolved since you first began working on it?\n\n\nI understand that you have built integrations with a number of BI and dashboarding tools such as Looker, Tableau, Superset, etc. What are the use cases that those integrations enable?\n\nWhat are the challenges or complexities involved in building and maintaining those integrations?\n\n\nWhat are the other categories of integration that you have had to implement to make SelectStar a viable solution?\nCan you describe the workflow of a team that is using SelectStar to collaborate on data engineering and analytics?\nWhat have been the most complex or difficult problems to solve for?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SelectStar used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SelectStar?\nWhen is SelectStar the wrong choice?\nWhat do you have planned for the future of SelectStar?\n\nContact Info\n\nLinkedIn\n@shinjikim on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSelectStar\nUniversity of Waterloo\nKafka\nStorm\nConcord Systems\nAkamai\nSnowflake\n\nPodcast Episode\n\n\nBigQuery\nLooker\n\nPodcast Episode\n\n\nTableau\ndbt\n\nPodcast Episode\n\n\nOpenLineage\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Shinji Kim about her experience building the SelectStar data discovery platform to streamline communications about data across your organization","date_published":"2021-07-30T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/26e2f7d1-59b0-4bb2-90d6-aa9d0110e6b1.mp3","mime_type":"audio/mpeg","size_in_bytes":43084376,"duration_in_seconds":3083}]},{"id":"podlove-2021-07-28t02:34:57+00:00-2b0f042208e4eba","title":"Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax","url":"https://www.dataengineeringpodcast.com/datastax-astra-streaming-managed-pulsar-episode-207","content_text":"Summary\nEveryone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\n\n\nCan you describe what the Astra platform is and the story behind it?\n\n\nHow does streaming fit into your overall product vision and the needs of your customers?\n\n\nWhat was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?\n\n\nWhat are the core use cases that you are aiming to support with Astra Streaming?\n\n\nCan you describe the architecture and automation of your hosted platform for Pulsar?\n\nWhat are the integration points that you have built to make it work well with Cassandra?\n\n\n\nWhat are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?\n\n\nWhat are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?\n\n\nWhat is the process for someone to adopt and integrate with your Astra Streaming service?\n\nHow do you handle migrating existing projects, particularly if they are using Kafka currently?\n\n\n\nOne of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?\n\nWhat are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?\n\n\n\nWhat are the ways that you are engaging with and supporting the Pulsar community?\n\nWhat are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?\n\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Astra used?\n\n\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?\n\n\nWhen is Astra the wrong choice?\n\n\nWhat do you have planned for the future of Astra?\n\n\nContact Info\n\nPrabhat\n\nLinkedIn\n@prabhatja on Twitter\nprabhatja on GitHub\n\n\nJonathan\n\nLinkedIn\n@spyced on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPulsar\n\nPodcast Episode\nStreamnative Episode\n\n\nDatastax Astra Streaming\nDatastax Astra DB\nLuna Streaming Distribution\nDatastax\nCassandra\nKesque (formerly Kafkaesque)\nKafka\nRabbitMQ\nPrometheus\nGrafana\nPulsar Heartbeat\nPulsar Summit\nPulsar Summit Presentation on Kafka Connectors\nReplicated\nChaos Engineering\nFallout chaos engineering tools\nJepsen\n\nPodcast Episode\n\n\nJack VanLightly\n\nBookKeeper TLA+ Model\n\n\nChange Data Capture\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the operational and architectural complexities of building a managed service of Apache Pulsar at scale for Datastax to power streaming data workloads.","date_published":"2021-07-27T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e363cc12-f3e1-4d24-80f2-278cbb67b64e.mp3","mime_type":"audio/mpeg","size_in_bytes":49447736,"duration_in_seconds":3612}]},{"id":"podlove-2021-07-22t23:01:07+00:00-714f1300e389134","title":"Bringing The Metrics Layer To The Masses With Transform","url":"https://www.dataengineeringpodcast.com/transform-co-metrics-layer-episode-206","content_text":"Summary\nCollecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Transform is and the story behind it?\nHow do you define the concept of a \"metric\" in the context of the data platform?\nWhat are the general strategies in the industry for creating, managing, and consuming metrics?\n\nHow has that been changing in the past couple of years?\n\nWhat is driving that shift?\n\n\n\n\nWhat are the main goals that you have for the Transform platform?\n\nWho are the target users? How does that focus influence your approach to the design of the platform?\n\n\nHow is the Transform platform architected?\n\nWhat are the core capabilities that are required for a metrics service?\n\n\nWhat are the integration points for a metrics service?\nCan you talk through the workflow of defining and consuming metrics with Transform?\n\nWhat are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition?\nWhat are the lifecycle stages that need to be factored into the long-term maintenance of a metric definition?\n\n\nWhat are some of the capabilities or projects that are made possible by having a metrics layer in the data platform?\nWhat are the capabilities in downstream tools that are currently missing or underdeveloped to support the metrics store as a core layer of the platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Transform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Transform?\nWhen is Transform the wrong choice?\nWhat do you have planned for the future of Transform?\n\nContact Info\n\nLinkedIn\n@nick_handel on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTransform\nTransform’s Metrics Framework\nTransform’s Metrics Catalog\nTransform’s Metrics API\nNick’s experiences using Airbnb’s Metrics Store\nGet Transform\nBlackRock\nAirBnB\nAirflow\nSuperset\n\nPodcast Episode\n\n\nAirBnB Knowledge Repo\nAirBnB Minerva Metric Store\nOLAP Cube\nSemantic Layer\nMaster Data Management\n\nPodcast Episode\n\n\nData Normalization\nOpenLineage\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Nick Handel about the benefits of a unified metrics layer for improving the confidence of your analytics and his work at Transform to make it accessible to everyone.","date_published":"2021-07-22T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/19f07981-4680-4aeb-b521-0e65d30670c2.mp3","mime_type":"audio/mpeg","size_in_bytes":53371815,"duration_in_seconds":3677}]},{"id":"podlove-2021-07-19t23:58:47+00:00-24ff79fc69b9785","title":"Strategies For Proactive Data Quality Management","url":"https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205","content_text":"Summary\nData quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Datafold and the story behind it?\nWhat are the biggest factors that you see contributing to data quality issues?\n\nHow are teams identifying and addressing those failures?\n\n\nHow does the data platform architecture impact the potential for introducing quality problems?\nWhat are some of the potential risks or consequences of introducing errors in data processing?\nHow can organizations shift to being proactive in their data quality management?\n\nHow much of a role does tooling play in addressing the introduction and remediation of data quality problems?\n\n\nCan you describe how Datafold is designed and architected to allow for proactive management of data quality?\n\nWhat are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold?\n\n\nWhat is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development?\nWhat are the organizational patterns that you have found to be most conducive to proactive data quality management?\n\nWho is responsible for identifying and addressing quality issues?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Datafold used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?\nWhen is Datafold the wrong choice?\nWhat do you have planned for the future of Datafold?\n\nContact Info\n\nLinkedIn\n@glebmm on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDatafold\nAutodesk\nAirflow\n\nPodcast.__init__ Episode\n\n\nSpark\nLooker\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nChange Data Capture\n\nPodcast Episodes\n\n\nDelta Lake\n\nPodcast Episode\n\n\nTrino\n\nPodcast Episode\n\n\nPresto\nParquet\n\nPodcast Episode\n\n\nData Quality Meetup\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\nSpecial Guest: Gleb Mezhanskiy.","content_html":"

Summary

\n

Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

Special Guest: Gleb Mezhanskiy.

","summary":"An interview with Gleb Mezhanskiy about his work at Datafold and how it has informed his strategies for proactive management of data quality across your organization.","date_published":"2021-07-19T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7368e374-4299-43d4-a576-4999f38ed668.mp3","mime_type":"audio/mpeg","size_in_bytes":47762782,"duration_in_seconds":3666}]},{"id":"podlove-2021-07-16t11:34:11+00:00-c32c7341f40a9a5","title":"Low Code And High Quality Data Engineering For The Whole Organization With Prophecy","url":"https://www.dataengineeringpodcast.com/prophecy-low-code-data-engineering-episode-204","content_text":"Summary\nThere is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Prophecy and the story behind it?\nThere are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking?\n\nWhat features and capabilities does Prophecy provide to help address those issues?\n\n\nWhat are the roles and use cases that you are focusing on serving with Prophecy?\nWhat are the elements of the data platform that Prophecy can replace?\nCan you describe how Prophecy is implemented?\n\nWhat was your selection criteria for the foundational elements of the platform?\nWhat would be involved in adopting other execution and orchestration engines?\n\n\nCan you describe the workflow of building a pipeline with Prophecy?\n\nWhat are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity?\nWhat are the options for data engineers/data professionals to build and share reusable components across the organization?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Prophecy used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy?\nWhen is Prophecy the wrong choice?\nWhat do you have planned for the future of Prophecy?\n\nContact Info\n\nLinkedIn\n@_raj_bains on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nProphecy\nCUDA\nApache Hive\nHortonworks\nNoSQL\nNewSQL\nPaxos\nApache Impala\nAbInitio\nTeradata\nSnowflake\n\nPodcast Episode\n\n\nPresto\n\nPodcast Episode\n\n\nLinkedIn\nSpark\nDatabricks\nCron\nAirflow\nAstronomer\nAlteryx\nStreamsets\nAzure Data Factory\nApache Flink\n\nPodcast Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nKubernetes Operator\nScala\nKafka\nAbstract Syntax Tree\nLanguage Server Protocol\nAmazon Deequ\ndbt\nTecton\n\nPodcast Episode\n\n\nInformatica\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Raj Bains about how the Prophecy platform provides a smooth experience for the whole organization to build high quality data engineering workflows with a unified model that brings engineers and business users together in one experience.","date_published":"2021-07-16T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7015d8f7-8f7e-4e90-8390-b4c64fa78af2.mp3","mime_type":"audio/mpeg","size_in_bytes":66872926,"duration_in_seconds":4355}]},{"id":"podlove-2021-07-13t01:42:59+00:00-db83cb44448463e","title":"Exploring The Design And Benefits Of The Modern Data Stack","url":"https://www.dataengineeringpodcast.com/godatadriven-modern-data-stack-episode-203","content_text":"Summary\nWe have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and \"best practices\" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the \"Modern Data Stack\". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving your definition of the modern data stack?\n\nWhat are the key characteristics of a tool or platform that make it a candidate for the \"modern\" stack?\n\n\nHow does the modern data stack shift the responsibilities and capabilities of data professionals and consumers?\nWhat are some difficulties that you face when working with customers to migrate to these new architectures?\nWhat are some of the limitations of the components or paradigms of the modern stack?\n\nWhat are some strategies that you have devised for addressing those limitations?\nWhat are some edge cases that you have run up against with specific vendors that you have had to work around?\nWhat are the \"gotchas\" that you don’t run up against until you’ve deployed a service and started using it at scale and over time?\n\n\nHow does data governance get applied across the various services and systems of the modern stack?\nOne of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users?\nWhat is the role of data engineers in the context of the \"modern\" stack?\nWhat are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack?\nWhen is the modern data stack the wrong choice?\nWhat new architectures or tools are you keeping an eye on for future client work?\n\nContact Info\n\nGuillermo\n\nLinkedIn\nguillesd on GitHub\n\n\nBram\n\nLinkedIn\nbramochsendorf on GitHub\n\n\nJuan\n\nLinkedIn\njmperafan on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nGoDataDriven\nDeloitte\nRPA == Robotic Process Automation\nAnalytics Engineer\nJames Webb Space Telescope\nFivetran\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nData Governance\n\nPodcast Episodes\n\n\nAzure Cloud Platform\nStitch Data\nAirflow\nPrefect\nArgo Project\nLooker\nAzure Purview\nSoda Data\n\nPodcast Episode\n\n\nDatafold\nMaterialize\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about the design and motivation of the "modern data stack" and how it can simplify the work of building a self-service data platform that enables everyone in the business to ask and answer questions with data.","date_published":"2021-07-12T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bfd208e0-951b-4b10-84df-a6aa0bd35c12.mp3","mime_type":"audio/mpeg","size_in_bytes":32372294,"duration_in_seconds":2941}]},{"id":"podlove-2021-07-09t22:31:41+00:00-4e92b1dbf0c14d1","title":"Democratize Data Cleaning Across Your Organization With Trifacta","url":"https://www.dataengineeringpodcast.com/trifacta-data-cleaning-episode-202","content_text":"Summary\nEvery data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Trifacta is and the story behind it?\nAcross your site and material you focus on using the term \"data wrangling\". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT?\n\nHow does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform?\n\n\nWhat is Trifacta’s role in the overall data platform/data lifecycle for an organization?\n\nWhat are some examples of tools that Trifacta might replace?\nWhat tools or systems does Trifacta integrate with?\n\n\nWho are the target end-users of the Trifacta platform and how do those personas direct the design and functionality?\nCan you describe how Trifacta is architected?\n\nHow have the goals and design of the system changed or evolved since you first began working on it?\n\n\nCan you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it?\nHow can data engineers share and encourage proper patterns for working with data assets with end-users across the organization?\nWhat are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools?\n\nWhat are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Trifacta used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata?\nWhen is Trifacta the wrong choice?\nWhat do you have planned for the future of Trifacta?\n\nContact Info\n\nLinkedIn\n@a_adam_wilson on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTrifacta\nInformatica\nUC Berkeley\nStanford University\nCitadel\n\nPodcast Episode\n\n\nStanford Data Wrangler\nDBT\n\nPodcast Episode\n\n\nPig\nDatabricks\nSqoop\nFlume\nSPSS\nTableau\nSDLC == Software Delivery Life-Cycle\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Trifacta CEO Adam Wilson about how the platform is used to democratize data cleaning for everyone in the organization","date_published":"2021-07-09T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1551b3b8-28f3-45a4-8ec1-9fa913d4b988.mp3","mime_type":"audio/mpeg","size_in_bytes":52350189,"duration_in_seconds":4033}]},{"id":"podlove-2021-07-05t23:01:22+00:00-c50ce2ae6f378d0","title":"Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager","url":"https://www.dataengineeringpodcast.com/saasglue-cloud-workflow-manager-episode-201","content_text":"Summary\nAt the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what SaasGlue is and the story behind it?\n\nI understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project?\n\n\nWhat are the main use cases that you are focused on enabling?\n\nWho are your target users and how has that influenced the features and design of the platform?\n\n\nOrchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem?\n\nWhat are some of the ways that you see it integrated into a data platform?\n\n\nWhat are the core elements and concepts of the SaaSGlue platform?\nHow is the SaaSGlue platform architected?\n\nHow have the goals and design of the platform changed or evolved since you first began working on it?\nWhat are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it?\n\n\nCan you talk through the workflow of someone building a task graph with SaaSGlue?\nHow do you handle dependency management for custom code in the payloads for agent tasks?\nHow does SaasGlue manage metadata propagation throughout the execution graph?\nHow do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.)\nWhat are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue?\nWhat are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue?\nWhen is SaaSGlue the wrong choice?\nWhat do you have planned for the future of SaaSGlue?\n\nContact Info\n\nRich\n\nLinkedIn\n\n\nBart\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nSaaSGlue\nJenkins\nCron\nAirflow\nAnsible\nTerraform\nDSL == Domain Specific Language\nClojure\nGradle\nPolymorphism\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nMartin Kleppman\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the how the SaaSGlue workflow manager simplifies the process of sticking together all of your clouds, services, and data pipelines","date_published":"2021-07-05T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/36e0fb17-44d7-4214-83d0-2b24b0d6012b.mp3","mime_type":"audio/mpeg","size_in_bytes":40874662,"duration_in_seconds":3331}]},{"id":"podlove-2021-07-02t11:58:44+00:00-b79177246426f2e","title":"Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK","url":"https://www.dataengineeringpodcast.com/meltano-singer-data-integration-improvements-episode-200","content_text":"Summary\nData integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what the Singer ecosystem is?\nWhat are the current weak points/challenges in the ecosystem?\nWhat is the current role of the Meltano project/community within the ecosystem?\n\nWhat are the projects and activities related to Singer that you are focused on?\n\n\nWhat are the main goals of the Meltano Hub?\n\nWhat criteria are you using to determine which projects to include in the hub?\nWhy is the number of targets so small?\nWhat additional functionality do you have planned for the hub?\n\n\nWhat functionality does the SDK provide?\n\nHow does the presence of the SDK make it easier to write taps/targets?\nWhat do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be?\n\n\nNow that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work?\n\nHow do you hope to productize what you have built at Meltano?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project?\nWhen is Singer/Meltano the wrong choice?\nWhat do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK?\n\nContact Info\n\nDouwe\n\nWebsite\n\n\nTaylor\n\nLinkedIn\n@tayloramurphy on Twitter\nBlog\n\n\nAJ\n\nLinkedIn\n@aaronsteers on Twitter\naaronsteers on GitLab\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSinger\nMeltano\n\nPodcast Episode\n\n\nMeltano Hub\nSinger SDK\nConcert Genetics\nGitLab\nSnowflake\ndbt\n\nPodcast Episode\n\n\nMicrosoft SQL Server\nAirflow\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nAWS Athena\nReverse ETL\nREST (REpresentational State Transfer)\nGraphQL\nMeltano Interpretation of Singer Specification\nVision for the Future of Meltano blog post\nCoalesce Conference\nRunning Your Data Team Like A Product Team\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the Meltano team about how they are investing in the Singer ecosystem for data integration with the Meltano Hub and Singer SDK","date_published":"2021-07-02T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c359440e-7359-4099-9565-ddfad0fae6ae.mp3","mime_type":"audio/mpeg","size_in_bytes":50447363,"duration_in_seconds":3924}]},{"id":"podlove-2021-06-28t22:12:30+00:00-15be1a0bfc4e2e1","title":"A Candid Exploration Of Timeseries Data Analysis With InfluxDB","url":"https://www.dataengineeringpodcast.com/influxdb-timeseries-data-platform-episode-199","content_text":"Summary\nWhile the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Paul Dix about Influx Data and the different facets of the market for timeseries databases\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Influx Data and the story behind it?\nTimeseries data is a fairly broad category with many variations in terms of storage volume, frequency, processing requirements, etc. This has led to an explosion of database engines and related tools to address these different needs. How do you think about your position and role in the ecosystem?\n\nWho are your target customers and how does that focus inform your product and feature priorities?\nWhat are the use cases that Influx is best suited for?\n\n\nCan you give an overview of the different projects, tools, and services that comprise your platform?\nHow is InfluxDB architected?\n\nHow have the design and implementation of the DB engine changed or evolved since you first began working on it?\nWhat are you optimizing for on the consistency vs. availability spectrum of CAP?\nWhat is your approach to clustering/data distribution beyond a single node?\n\n\nFor the interface to your database engine you developed a custom query language. What was your process for deciding what syntax to use and how to structure the programmatic interface?\nHow do you handle the lifecycle of data in an Influx deployment? (e.g. aging out old data, periodic compaction/rollups, etc.)\nWith your strong focus on monitoring use cases, how do you handle the challenge of high cardinality in the data being stored?\nWhat are some of the data modeling considerations that users should be aware of as they are designing a deployment of Influx?\nWhat is the role of open source in your product strategy?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Influx platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Influx?\nWhen is Influx DB and/or the associated tools the wrong choice?\nWhat do you have planned for the future of Influx Data?\n\nContact Info\n\nLinkedIn\npauldix on GitHub\n@pauldix on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nInflux Data\nInflux DB\nSearch and Information Retrieval\nDatadog\n\nPodcast Episode\n\n\nNew Relic\nStackDriver\nScala\nCassandra\nRedis\nKDB\nLatent Semantic Indexing\nTICK Stack\nELK Stack\nPrometheus\nTSM storage engine\nTSI Storage Engine\nGolang\nRust Language\nRAFT Protocol\nTelegraf\nKafka\nInfluxQL\nFlux Language\nDataFusion\nApache Arrow\nApache Parquet\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Paul Dix about the technology that powers the Influx Data platform and the architectural evolution that keeps delivering better performance for your timeseries data","date_published":"2021-06-28T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/22f7aebc-2769-414d-ae05-9cccc51e6015.mp3","mime_type":"audio/mpeg","size_in_bytes":50989436,"duration_in_seconds":3962}]},{"id":"podlove-2021-06-26t01:47:15+00:00-ea0a8e358252538","title":"Lessons Learned From The Pipeline Data Engineering Academy","url":"https://www.dataengineeringpodcast.com/pipeline-data-engineering-academy-retrospective-episode-198","content_text":"Summary\nData Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing the curriculum and learning goals for the students?\nHow did you set a common baseline for all of the students to build from throughout the program?\n\nWhat was your process for determining the structure of the tasks and the tooling used?\n\n\nWhat were some of the topics/tools that the students had the most difficulty with?\n\nWhat topics/tools were the easiest to grasp?\n\n\nWhat are some difficulties that you encountered while trying to teach different concepts?\nHow did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for?\nWhat are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them?\nWhat are some of the failures that you encountered and what lessons have you taken from them?\nHow did the pandemic impact your overall plan and execution of the initial cohort?\nWhat were the skills that you focused on for interview preparation?\nWhat level of ongoing support/engagement do you have with students once they complete the curriculum?\nWhat are the most interesting, innovative, or unexpected solutions that you saw from your students?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort?\nWhen is a bootcamp the wrong approach for skill development?\nWhat do you have planned for the future of the Pipeline Academy?\n\nContact Info\n\nDaniel\n\nLinkedIn\nWebsite\n@soobrosa on Twitter\n\n\nPeter\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPipeline Academy\n\nBlog\n\n\nScikit\nPandas\nUrchin\nKafka\nThree \"C\"s – Context, Confidence, and Code\nPrefect\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nDocker\nKubernetes\nBecome a Data Engineer On A Shoestring\nJames Mickens\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with the co-founders of the Pipeline Data Engineering Academy about the lessons that they learned along with their first cohort of students.","date_published":"2021-06-25T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/eb9be557-71f3-4d25-8c4b-7aad409e18f7.mp3","mime_type":"audio/mpeg","size_in_bytes":52606599,"duration_in_seconds":4263}]},{"id":"podlove-2021-06-23t02:28:07+00:00-4d0d8130d79c28b","title":"Make Database Performance Optimization A Playful Experience With OtterTune","url":"https://www.dataengineeringpodcast.com/ottertune-database-performance-optimization-episode-197","content_text":"Summary\nThe database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Andy Pavlo about OtterTune, a system to continuously monitor and improve database performance via machine learning\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what OtterTune is and the story behind it?\n\nHow does it relate to your work with NoisePage?\n\n\nWhat are the challenges that database administrators, operators, and users run into when working with, configuring, and tuning transactional systems?\n\nWhat are some of the contributing factors to the sprawling complexity of the configurable parameters for these databases?\n\n\nCan you describe how OtterTune is implemented?\n\nWhat are some of the aggregate benefits that OtterTune can gain by running as a centralized service and learning from all of the systems that it connects to?\nWhat are some of the assumptions that you made when starting the commercialization of this technology that have been challenged or invalidated as you began working with initial customers?\nHow have the design and goals of the system changed or evolved since you first began working on it?\n\n\nWhat is involved in adding support for a new database engine?\n\nHow applicable are the OtterTune capabilities to analytical database engines?\n\n\nHow do you handle tuning for variable or evolving workloads?\nWhat are some of the most interesting or esoteric configuration options that you have come across while working on OtterTune?\n\nWhat are some that made you facepalm?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen OtterTune used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on OtterTune?\nWhen is OtterTune the wrong choice?\nWhat do you have planned for the future of OtterTune?\n\nContact Info\n\nCMU Page\napavlo on GitHub\n@andy_pavlo on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nOtterTune\nCMU (Carnegie Mellon University)\nBrown University\nMichael Stonebraker\nH-Store\nLearned Indexes\nNoisePage\nOracle DB\nPostgreSQL\n\nPodcast Episode\n\n\nMySQL\nRDS\nGaussian Process Model\nReinforcement Learning\nAWS Aurora\nMVCC (Multi-Version Concurrency Control)\nPuppet\nVectorWise\nGreenPlum\nSnowflake\n\nPodcast Episode\n\n\nPGTune\nMySQL Tuner\nSIGMOD\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Andy Pavlo about his work on OtterTune to automatically tune your database configuration for better performance.","date_published":"2021-06-22T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/345be12f-e7f1-4de9-adba-592e00aeb3ac.mp3","mime_type":"audio/mpeg","size_in_bytes":44328942,"duration_in_seconds":3508}]},{"id":"podlove-2021-06-18t00:59:28+00:00-1ba5140e4f91379","title":"Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk","url":"https://www.dataengineeringpodcast.com/unstruk-unstructured-data-warehouse-episode-196","content_text":"Summary\nWorking with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Unstruk Data is and the story behind it?\nWhat would you classify as \"unstructured data\"?\n\nWhat are some examples of industries that rely on large or varied sets of unstructured data?\nWhat are the challenges for analytics that are posed by the different categories of unstructured data?\n\n\nWhat is the current state of the industry for working with unstructured data?\n\nWhat are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem?\nWhere does it sit in the overall landscape of data tools?\n\n\nCan you describe how the Unstruk data warehouse is implemented?\n\nWhat are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials?\nHow has the design and architecture evolved or changed since you began working on it?\n\n\nHow do you handle versioning of data, given the potential for individual files to be quite large?\nWhat are some of the considerations that users should have in mind when modeling their data in the warehouse?\nCan you talk through the workflow of ingesting and analyzing data with Unstruk?\n\nHow do you manage data enrichment/integration with structured data sources?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen the technology of Unstruk used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on and with the Unstruk platform?\nWhen is Unstruk the wrong choice?\nWhat do you have planned for the future of Unstruk?\n\nContact Info\n\nLinkedIn\n@KirkMarple on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nUnstruk Data\nTIFF\nROSBag\nHDF5\nMedia/Digital Asset Management\nData Mesh\nSAN\nNAS\nKnowledge Graph\nEntity Extraction\nOCR (Optical Character Recognition)\nCloud Native\nCosmos DB\nAzure Functions\nAzure EventHub\nAzure Cognitive Search\nGraphQL\nKNative\nSchema.org\nPinecone Vector Database\n\nPodcast Episode\n\n\nDublin Core Metadata Initiative\nKnowledge Management\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Unstruk Data platform and how it automatically extracts metadata from unstructured data in order to build a searchable graph of information about your assets.","date_published":"2021-06-17T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e57c5551-4661-4370-845b-7461f0734568.mp3","mime_type":"audio/mpeg","size_in_bytes":31188326,"duration_in_seconds":2447}]},{"id":"podlove-2021-06-15t01:47:54+00:00-69bff49e2a7af6d","title":"Accelerating ML Training And Delivery With In-Database Machine Learning","url":"https://www.dataengineeringpodcast.com/in-database-machine-learning-episode-195","content_text":"Summary\nWhen you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the current state of the market for databases that support in-process machine learning?\n\nWhat are the motivating factors for running a machine learning workflow inside the database?\n\n\nWhat styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.)\nWhat are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts)\nCan you describe the architecture of how the machine learning process is managed by the database engine?\nHow do you manage interacting with Python/R/Jupyter/etc. when working within the database?\nWhat is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow?\nWhat are the most interesting, innovative, or unexpected ways that you have seen in-database ML used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database?\nWhen is in-database ML the wrong choice?\nWhat are the recent trends/changes in machine learning for the database that you are excited for?\n\nContact Info\n\nLinkedIn\nBlog\n@RobertsPaige on Twitter\n@PaigeEwing on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nVertica\nSyncSort\nHortonworks\nInfoworld – 8 databases supporting in-database machine learning\nPower BI\n\nPodcast Episode\n\n\nGrafana\nTableau\nK-Means Clustering\nMPP == Massively Parallel Processing\nAutoML\nRandom Forest\nPMML == Predictive Model Markup Language\nSVM == Support Vector Machine\nNaive Bayes\nXGBoost\nPytorch\nTensorflow\nNeural Magic\nTensorflow Frozen Graph\nParquet\nORC\nAvro\nCNCF == Cloud Native Computing Foundation\nHotel California\nVerticaPy\nPandas\n\nPodcast.__init__ Episode\n\n\nJupyter Notebook\nUDX\nUnifying Analytics Presentation\nHadoop\nYarn\nHolden Karau\nSpark\nVertica Academy\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the benefits of in-database machine learning for building and serving your models, and how Vertica is integrating those capabilities into their product.","date_published":"2021-06-14T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ca7f5660-5c93-4339-aed0-4df6f8f4b699.mp3","mime_type":"audio/mpeg","size_in_bytes":45623614,"duration_in_seconds":3932}]},{"id":"podlove-2021-06-12t01:13:22+00:00-b289ab4bbcdaf31","title":"Taking A Tour Of The Google Cloud Platform For Data And Analytics","url":"https://www.dataengineeringpodcast.com/google-cloud-platform-data-analytics-episode-194","content_text":"Summary\nGoogle pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics?\n\nHow do the various systems relate to each other for building a full workflow?\nHow do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform?\n\n\nWhat have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads?\nWhat are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers?\nWhat are the systems that you have found to be easiest to work with?\n\nWhich are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design?\n\n\nHow has your work with customers fed back into the products that you are building on top of?\nWhat are some examples of architectural or software patterns that are unique to the GCP product suite?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Google Cloud’s data and analytics services used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working at Google and helping customers succeed in their data and analytics efforts?\nWhat are some of the new capabilities, new services, or industry trends that you are most excited for?\n\nContact Info\n\nLinkedIn\n@lak_gcp on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nGoogle Cloud\n\nData and Analytics Services\n\n\nForrester Wave\nDremel\nBigQuery\nMapReduce\nCloud Spanner\n\nSpanner Paper\n\n\nHadoop\nTensorflow\nGoogle Cloud SQL\nApache Spark\nDataproc\nDataflow\nApache Beam\nDatabricks\nMixpanel\nAvalanche data warehouse\nKubernetes\nGKE (Google Kubernetes Engine)\nGoogle Cloud Run\nAndroid\nYoutube\nGoogle Translate\nTeradata\nPower BI\n\nPodcast Episode\n\n\nAI Platform Notebooks\nGitHub Data Repository\nStack Overflow Questions Data Repository\nPyPI Download Statistics\nRecommendations AI\nPub/Sub\nBigtable\nDatastream\nChange Data Capture\n\nPodcast Episode About Debezium for CDC\nPodcast Episode About CDC with Datacoral\n\n\nDocument AI\nGoogle Meet\nData Governance\n\nPodcast Episodes\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the data and analytics services available on the Google Cloud Platform and how they can be combined to simplify your workflows.","date_published":"2021-06-11T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6af933f4-f2a5-42d2-a013-470962a37e34.mp3","mime_type":"audio/mpeg","size_in_bytes":39083604,"duration_in_seconds":3196}]},{"id":"podlove-2021-06-09t01:19:09+00:00-e6559e96fd66336","title":"Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer","url":"https://www.dataengineeringpodcast.com/bookkeeper-fast-distributed-storage-episode-193","content_text":"Summary\nThe way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWe’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.\nYour host is Tobias Macey and today I’m interviewing Matteo Merli about Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what BookKeeper is and the story behind it?\nWhat are the most notable features/capabilities of BookKeeper?\nWhat are some of the ways that BookKeeper is being used?\nHow has your work on Pulsar influenced the features and product direction of BookKeeper?\nCan you describe the architecture of a BookKeeper cluster?\n\nHow have the design and goals of BookKeeper changed or evolved over time?\n\n\nWhat is the impact of record-oriented storage on data distribution/allocation within the cluster when working with variable record sizes?\nWhat are some of the operational considerations that users should be aware of?\nWhat are some of the most interesting/compelling features from your perspective?\nWhat are some of the most often overlooked or misunderstood capabilities of BookKeeper?\nWhat are the most interesting, innovative, or unexpected ways that you have seen BookKeeper used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on BookKeeper?\nWhen is BookKeeper the wrong choice?\nWhat do you have planned for the future of BookKeeper?\n\nContact Info\n\nLinkedIn\n@merlimat on Twitter\nmerlimat on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nApache BookKeeper\nApache Pulsar\n\nPodcast Episode\n\n\nStreamNative\n\nPodcast Episode\n\n\nHadoop NameNode\nApache Zookeeper\n\nPodcast Episode\n\n\nActiveMQ\nWrite Ahead Log (WAL)\nBookKeeper Architecture\nRocksDB\nLSM == Log-Structured Merge-Tree\nRAID Controller\nPravega\n\nPodcast Episode\n\n\nBookKeeper etcd Metadata Storage\nLevelDB\nCeph\n\nPodcast Episode\n\n\nDirect IO\nPage Cache\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the BookKeeper project for fast and reliable distributed storage that scales up and down with your workloads and how it is being used for systems like Pulsar","date_published":"2021-06-08T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b08bd31d-3711-4546-ac63-7d0ebc0e120e.mp3","mime_type":"audio/mpeg","size_in_bytes":27520935,"duration_in_seconds":2521}]},{"id":"podlove-2021-06-03t23:08:16+00:00-b450dd6ce4c9d83","title":"Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook","url":"https://www.dataengineeringpodcast.com/querybook-big-data-sql-ide-episode-192","content_text":"Summary\nSQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nFirebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Justin Mejorada-Pier and Charlie Gu about Querybook, an open source IDE for your big data projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Querybook is and the story behind it?\nWhat are the main use cases or workflows that Querybook is designed for?\n\nWhat are the shortcomings of dashboarding/BI tools that make something like Querybook necessary?\n\n\nThe tag line calls out the fact that Querybook is an IDE for \"big data\". What are the manifestations of that focus in the feature set and user experience?\nWho are the target users of Querybook and how does that inform the feature priorities and user experience?\nCan you describe how Querybook is architected?\n\nHow have the goals and design changed or evolved since you first began working on it?\nWhat were some of the assumptions or design choices that you had to unwind in the process of open sourcing it?\n\n\nWhat is the workflow for someone building a DataDoc with Querybook?\n\nWhat is the experience of working as a collaborator on an analysis?\n\n\nHow do you handle lifecycle management of query results?\nWhat are your thoughts on the potential for extending Querybook beyond SQL-oriented analysis and integrating something like Jupyter kernels?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Querybook used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Querybook?\nWhen is Querybook the wrong choice?\nWhat do you have planned for the future of Querybook?\n\nContact Info\n\nJustin\n\nLinkedIn\nWebsite\n\n\nCharlie\n\nczgu on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nQuerybook\n\nAnnouncing Querybook as Open Source\n\n\nPinterest\nUniversity of Waterloo\nSuperset\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nSequel Pro\nPresto\nTrino\n\nPodcast Episode\n\n\nFlask\nuWSGI\n\nPodcast.__init__ Episode\n\n\nCelery\nRedis\nSocketIO\nElasticsearch\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nApache Atlas\nDataHub\n\nPodcast Episode\n\n\nOkta\nLDAP (Lightweight Directory Access Protocol)\nGrand Rounds\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.","date_published":"2021-06-03T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bddc9435-3d47-4dcc-8a3b-74f2f2f5908f.mp3","mime_type":"audio/mpeg","size_in_bytes":37052328,"duration_in_seconds":3155}]},{"id":"podlove-2021-05-31t23:03:02+00:00-ed10e76c3673d96","title":"Making Data Pipelines Self-Serve For Everyone With Shipyard","url":"https://www.dataengineeringpodcast.com/shipyard-self-serve-data-pipelines-episode-191","content_text":"Summary\nEvery part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWhen it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.\nYour host is Tobias Macey and today I’m interviewing Blake Burch about Shipyard, and his mission to create the easiest way for data teams to launch, monitor, and share resilient pipelines with less engineering\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Shipyard and the story behind it?\nWhat are the main goals that you have for Shipyard?\n\nHow does it compare to other data orchestration frameworks in the market?\n\n\nWho are the target users of Shipyard and how does that influence the features and design of the product?\n\nWhat are your thoughts on the role of data orchestration in the business?\n\n\nHow is the Shipyard platform implemented?\n\nWhat was your process for identifying the core requirements of the platform?\nHow have the design and goals of the system evolved since you first began working on it?\n\n\nCan you describe the workflow of building a data workflow with Shipyard?\n\nHow do you manage the dependency chain across tasks in the execution graph? (e.g. task-based, data assets, etc.)\n\n\nHow do you handle testing and data quality management in your workflows?\nWhat is the interface for creating custom task definitions?\n\nHow do you address dependencies and sandboxing for custom code?\n\n\nWhat is your approach to developing templates?\nWhat are the operational challenges that you have had to address to manage scaling and multi-tenancy in your platform?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Shipyard used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Shipyard?\nWhen is Shipyard the wrong choice?\nWhat do you have planned for the future of Shipyard?\n\nContact Info\n\nLinkedIn\n@BlakeBurch_ on Twitter\nWebsite\nblakeburch on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nShipyard\nZapier\nAirtable\nBigQuery\nSnowflake\n\nPodcast Episode\n\n\nDocker\nECS == Elastic Container Service\nGreat Expectations\n\nPodcast Episode\n\n\nMonte Carlo\n\nPodcast Episode\n\n\nSoda Data\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Every part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Shipyard platform is designed to make data pipelines more accessible by everyone in the business with a graphical approach to wiring together reusable processing steps.","date_published":"2021-06-01T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/62e4e49b-2826-47ea-8021-0ebf13367efc.mp3","mime_type":"audio/mpeg","size_in_bytes":43774946,"duration_in_seconds":3082}]},{"id":"podlove-2021-05-26t01:43:18+00:00-a6397b88089a59b","title":"Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse","url":"https://www.dataengineeringpodcast.com/yellowbrick-distributed-cloud-data-warehouse-episode-190","content_text":"Summary\nThe data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nFirebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Mark Cusack about Yellowbrick, a data warehouse designed for distributed clouds\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Yellowbrick is and some of the story behind it?\nWhat does the term \"distributed cloud\" signify and what challenges are associated with it?\nHow would you characterize Yellowbrick’s position in the database/DWH market?\nHow is Yellowbrick architected?\n\nHow have the goals and design of the platform changed or evolved over time?\n\n\nHow does Yellowbrick maintain visibility across the different data locations that it is responsible for?\n\nWhat capabilities does it offer for being able to join across the disparate \"clouds\"?\n\n\nWhat are some data modeling strategies that users should consider when designing their deployment of Yellowbrick?\nWhat are some of the capabilities of Yellowbrick that you find most useful or technically interesting?\nFor someone who is adopting Yellowbrick, what is the process for getting it integrated into their data systems?\nWhat are the most underutilized, overlooked, or misunderstood features of Yellowbrick?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Yellowbrick used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Yellowbrick?\nWhen is Yellowbrick the wrong choice?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nLinkedIn\n@markcusack on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nYellowbrick\nTeradata\nRainstor\nDistributed Cloud\nHybrid Cloud\nSwimOS\n\nPodcast Episode\n\n\nKafka\nPulsar\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nAWS Redshift\nMPP == Massively Parallel Processing\nPresto\nTrino\n\nPodcast Episode\n\n\nL3 Cache\nNVMe\nReactive Programming\nCoroutine\nStar Schema\nDenodo\nLexis Nexis\nVertica\nNetezza\nGrenplum\nPostgreSQL\n\nPodcast Episode\n\n\nClickhouse\n\nPodcast Episode\n\n\nErasure Coding\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Yellowbrick's CTO about the engineering behind their data warehouse engine that was built for speed and deployment across distributed clouds.","date_published":"2021-05-27T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c959cfa0-fe61-45a9-8e9c-34a2ae93cc55.mp3","mime_type":"audio/mpeg","size_in_bytes":41429456,"duration_in_seconds":3160}]},{"id":"podlove-2021-05-25t12:19:17+00:00-27b4f01e3e59b9f","title":"Easily Build Advanced Similarity Search With The Pinecone Vector Database","url":"https://www.dataengineeringpodcast.com/pinecone-vector-database-similarity-search-episode-189","content_text":"Summary\nMachine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWhen it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.\nYour host is Tobias Macey and today I’m interviewing Edo Liberty about Pinecone, a vector database for powering machine learning and similarity search\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Pinecone is and the story behind it?\nWhat are some of the contexts where someone would want to perform a similarity search?\n\nWhat are the considerations that someone should be aware of when deciding between Pinecone and Solr/Lucene for a search oriented use case?\n\n\nWhat are some of the other use cases that Pinecone enables?\nIn the absence of Pinecone, what kinds of systems and solutions are people building to address those use cases?\nWhere does Pinecone sit in the lifecycle of data and how does it integrate with the broader data management ecosystem?\n\nWhat are some of the systems, tools, or frameworks that Pinecone might replace?\n\n\nHow is Pinecone implemented?\n\nHow has the architecture evolved since you first began working on it?\n\n\nWhat are the most complex or difficult aspects of building Pinecone?\nWho is your target user and how does that inform the user experience design and product development priorities?\nFor someone who wants to start using Pinecone, what is involved in populating it with data building an analysis or service with it?\nWhat are some of the data modeling considerations when building a set of vectors in Pinecone?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Pinecone used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Pinecone technology and business?\nWhen is Pinecone the wrong choice?\nWhat do you have planned for the future of Pinecone?\n\nContact Info\n\nWebsite\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nPinecone\nTheoretical Physics\nHigh Dimensional Geometry\nAWS Sagemaker\nVisual Cortex\nTemporal Lobe\nInverted Index\nElasticsearch\n\nPodcast Episode\n\n\nSolr\nLucene\nNMSLib\nJohnson-Lindenstrauss Lemma\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Edo Liberty about the Pinecone vector database and how it makes it easy to build a similarity search service.","date_published":"2021-05-25T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c5152046-80b3-4ff6-94a0-8a827a737c6d.mp3","mime_type":"audio/mpeg","size_in_bytes":27833244,"duration_in_seconds":2807}]},{"id":"podlove-2021-05-21t00:50:01+00:00-b3832a71400df63","title":"A Holistic Approach To Data Governance Through Self Reflection At Collibra","url":"https://www.dataengineeringpodcast.com/collibra-enterprise-data-governance-episode-188","content_text":"Summary\nData governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and technological processes that are necessary for a well-managed data governance strategy, how Collibra is designed to aid in that endeavor, and his experiences using the platform that his company is building to help power the company. This is an excellent conversation that spans the engineering and philosophical complexities of an important and ever-present aspect of working with data.\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\n\n\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\n\n\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\n\n\nYour host is Tobias Macey and today I’m interviewing Stijn Christiaens about data governance in the enterprise and how Collibra applies the lessons learned from their customers to their own business\n\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Collibra and the story behind the company?\nWat does \"data governance\" mean to you, and how does that definition inform your work at Collibra?\n\nHow would you characterize the current landscape of \"data governance\" offerings and Collibra’s position within it?\n\n\nWhat are the elements of governance that are often ignored in small/medium businesses but which are essential for the enterprise? (e.g. data stewards, business glossaries, etc.)\nOne of the most important tasks as a data professional is to establish and maintain trust in the information you are curating. What are the biggest obstacles to overcome in that mission?\nWhat are some of the data problems that you will only find at large or complex organizations?\n\nHow does Collibra help to tame that complexity?\n\n\nWho are the end users of Collibra within an organization?\nCan you talk through the workflow and various interactions that your customers have as it relates to the overall flow of data through an organization?\nCan you describe how the Collibra platform is implemented?\n\nHow has the scope and design of the system evolved since you first began working on it?\n\n\nYou are currently leading a team that uses Collibra to manage the operations of the business. What are some of the most notable surprises that you have learned from being your own customer?\n\nWhat are some of the weak points that you have been able to identify and resolve?\nHow have you been able to use those lessons to help your customers?\n\n\nWhat are the activities that are resistant to automation?\n\nHow do you design the system to allow for a smooth handoff between mechanistic and humanistic processes?\n\n\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Collibra used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Collibra, and running the internal data office?\nWhen is Collibra the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n@stichris on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCollibra\nCollibra Data Office\nElectrical Engineering\nResistor Color Codes\nSTAR Lab (semantics, technology, and research)\nMicrosoft Azure\nData Governance\nGDPR\nChief Data Officer\nDunbar’s Number\nBusiness Glossary\nData Steward\nERP == Enterprise Resource Planning\nCRM == Customer Relationship Management\nData Ownership\nData Mesh\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and technological processes that are necessary for a well-managed data governance strategy, how Collibra is designed to aid in that endeavor, and his experiences using the platform that his company is building to help power the company. This is an excellent conversation that spans the engineering and philosophical complexities of an important and ever-present aspect of working with data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Stijn Christiaens about his experience building Collibra to address the complexities of data governance in the enterprise, and what he has learned from using his own product to run the business.","date_published":"2021-05-20T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c1827367-fe20-497b-acc5-f751645e080e.mp3","mime_type":"audio/mpeg","size_in_bytes":31288011,"duration_in_seconds":3352}]},{"id":"podlove-2021-05-18t14:37:15+00:00-e74eaf864a56ef5","title":"Unlocking The Power of Data Lineage In Your Platform with OpenLineage","url":"https://www.dataengineeringpodcast.com/openlineage-data-lineage-specification-episode-187","content_text":"Summary\nData lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nWhen it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.\nYour host is Tobias Macey and today I’m interviewing Julien Le Dem about Open Lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what the Open Lineage project is and the story behind it?\nWhat is the current state of the ecosystem for generating and sharing metadata between systems?\nWhat are your goals for the OpenLineage effort?\nWhat are the biggest conceptual or consistency challenges that you are facing in defining a metadata model that is broad and flexible enough to be widely used while still being prescriptive enough to be useful?\nWhat is the current state of the project? (e.g. code available, maturity of the specification, etc.)\n\nWhat are some of the ideas or assumptions that you had at the beginning of this project that have had to be revisited as you iterate on the definition and implementation?\n\n\nWhat are some of the projects/organizations/etc. that have committed to supporting or adopting OpenLineage?\nWhat problem domain(s) are best suited to adopting OpenLineage?\nWhat are some of the problems or use cases that you are explicitly not including in scope for OpenLineage?\nFor someone who already has a lineage and/or metadata catalog, what is involved in evolving that system to work well with OpenLineage?\nWhat are some of the downstream/long-term impacts that you anticipate or hope that this standardization effort will generate?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while working on the OpenLineage effort?\nWhat do you have planned for the future of the project?\n\nContact Info\n\nLinkedIn\n@J_ on Twitter\njulienledem on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nOpenLineage\nMarquez\n\nPodcast Episode\n\n\nHadoop\nPig\nApache Parquet\n\nPodcast Episode\n\n\nDoug Cutting\nAvro\nApache Arrow\nService Oriented Architecture\nData Lineage\nApache Atlas\nDataHub\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nEgeria\nPandas\n\nPodcast.__init__ Episode\n\n\nApache Spark\nEXIF\nJSON Schema\nOpenTelemetry\n\nPodcast.__init__ Episode\n\n\nOpenTracing\nSuperset\n\nPodcast.__init__ Episode\nData Engineering Podcast Episode\n\n\nIceberg\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast Episode\n\n\ndbt\n\nPodcast Episode\n\n\nData Mesh\n\nPodcast Episode\n\n\nThe map is not the territory\nKafka\nApache Flink\nApache Storm\nKafka Streams\nStone Soup\nApache Beam\nLinux Foundation AI & Data\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.","date_published":"2021-05-18T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/60b79204-e1cf-4d92-b99d-eb350e8305c5.mp3","mime_type":"audio/mpeg","size_in_bytes":40642038,"duration_in_seconds":3458}]},{"id":"podlove-2021-05-14t02:31:31+00:00-873d87fad7cae03","title":"Building Your Data Warehouse On Top Of PostgreSQL","url":"https://www.dataengineeringpodcast.com/postgresql-data-warehouse-episode-186","content_text":"Summary\nThere is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nFirebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Thomas Richter and Joshua Drake about using Postgres as your data warehouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by establishing a working definition of what constitutes a data warehouse for the purpose of this discussion?\n\nWhat are the limitations for out-of-the-box Postgres when trying to use it for these workloads?\n\n\nThere are a large and growing number of options for data warehouse style workloads. How would you categorize the different systems and what is PostgreSQL’s position in that ecosystem?\n\nWhat do you see as the motivating factors for a team or organization to select from among those categories?\n\n\nWhy would someone want to use Postgres as their data warehouse platform rather than using a purpose-built engine?\nWhat is the cost/performance equation for Postgres as compared to other data warehouse solutions?\nFor someone who wants to turn Postgres into a data warehouse engine, what are their options?\n\nWhat are the relative tradeoffs of the different open source and commercial offerings? (e.g. Citus, cstore_fdw, zedstore, Swarm64, Greenplum, etc.)\n\n\nOne of the biggest areas of growth right now is in the \"cloud data warehouse\" market where storage and compute are decoupled. What are the options for making that possible with Postgres? (e.g. using foreign data wrappers for interacting with data lake storage (S3, HDFS, Alluxio, etc.))\nWhat areas of work are happening in the Postgres community for upcoming releases to make it more easily suited to data warehouse/analytical workloads?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Postgres used in analytical contexts?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned from your own experiences of building analytical systems with Postgres?\nWhen is Postgres the wrong choice for a data warehouse?\nWhat are you most excited for/what are you keeping an eye on in upcoming releases of Postgres and its ecosystem?\n\nContact Info\n\nThomas\n\nLinkedIn\n\n\nJD\n\nLinkedIn\n@linuxhiker on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPostgreSQL\n\nPodcast Episode\n\n\nSwarm64\n\nPodcast Episode\n\n\nCommand Prompt Inc.\nIBM\nCognos\nOLAP Cube\nMariaDB\nMySQL\nPowell’s Books\nDBase\nPractical PostgreSQL\nNetezza\nPresto\nTrino\nApache Drill\nParquet\nParquet Foreign Data Wrapper\nSnowflake\n\nPodcast Episode\n\n\nAmazon RDS\nAmazon Aurora\nHyperscale\nCitus\nTimescaleDB\n\nPodcast Episode\nFollowup Podcast Episode\n\n\nGreenplum\nzedstore\nRedshift\nMicrosoft SQL Server\nPostgres Tablespaces\nDebezium\n\nPodcast Episode\n\n\nEDI == Enterprise Data Integration\nChange Data Capture\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how you can build your data warehouse on top of PostgreSQL for flexibility and full control over your data.","date_published":"2021-05-13T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3152e927-75af-43e8-ac4e-2364334e1cb9.mp3","mime_type":"audio/mpeg","size_in_bytes":58302944,"duration_in_seconds":4506}]},{"id":"podlove-2021-05-11t01:29:49+00:00-b96d56da76433d2","title":"Making Analytical APIs Fast With Tinybird","url":"https://www.dataengineeringpodcast.com/tinybird-analytical-api-platform-episode-185","content_text":"Summary\nBuilding an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about the challenges of building a maintainable business from a technical and product perspective.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nAscend.io — recognized as a 2021 Gartner Cool Vendor in Enterprise AI Operationalization and Engineering—empowers data teams to to build, scale, and operate declarative data pipelines with 95% less code and zero maintenance. Connect to any data source using Ascend’s new flex code data connectors, rapidly iterate on transformations and send data to any destination in a fraction of the time it traditionally takes—just ask companies like Harry’s, HNI, and Mayvenn. Sound exciting? Come join the team! We’re hiring data engineers, so head on over to dataengineeringpodcast.com/ascend and check out our careers page to learn more.\nYour host is Tobias Macey and today I’m interviewing Jorge Sancha about Tinybird, a platform to easily build analytical APIs for real-time data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Tinybird and the story behind it?\nWhat are some of the types of use cases that your customers are focused on?\nWhat are the areas of complexity that come up when building analytical APIs that are often overlooked when first designing a system to operate on and expose real-time data?\n\nWhat are the supporting systems that are necessary and useful for operating this kind of system which contribute to the overall time and engineering cost beyond the baseline functionality?\n\n\nHow is the Tinybird platform architected?\n\nHow have the goals and implementation of Tinybird changed or evolved since you first began building it?\n\n\nWhat was your criteria for selecting the core building block of your platform, and how did that lead to your choice to build on top of Clickhouse?\nWhat are some of the sharp edges that you have run into while operating Clickhouse?\n\nWhat are some of the custom tools or systems that you have built to help deal with them?\n\n\nWhat are some of the performance challenges that an API built with Tinybird might run into?\n\nWhat are the considerations that users should be aware of to avoid introducing performance issues?\n\n\nHow do you handle multi-tenancy in your platform? (e.g. separate clusters, in-database quotas, etc.)\nFor users of Tinybird, can you talk through the workflow of getting it integrated into their platform and designing an API from their data?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Tinybird used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Tinybird?\nWhen is Tinybird the wrong choice?\nWhat do you have planned for the future of the product and business?\n\nContact Info\n\n@jorgesancha on Twitter\nLinkedIn\njorgesancha on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTinybird\nCarto\nPostgreSQL\n\nPodcast Episode\n\n\nPostGIS\nClickhouse\n\nPodcast Episode\n\n\nKafka\nTornado\n\nPodcast.__init__ Episode\n\n\nRedis\nFormula 1\nWeb Application Firewall\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about the challenges of building a maintainable business from a technical and product perspective.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about how Tinybird invested in Clickhouse to power analytical APIs that are fast to build and operate.","date_published":"2021-05-10T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bd27c25c-71d1-4caa-9d4a-003318c3ac70.mp3","mime_type":"audio/mpeg","size_in_bytes":38099766,"duration_in_seconds":3263}]},{"id":"podlove-2021-05-07t02:02:14+00:00-df690527da2c7fc","title":"Making Spark Cloud Native At Data Mechanics","url":"https://www.dataengineeringpodcast.com/data-mechanics-cloud-native-spark-episode-184","content_text":"Summary\nSpark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development and experimentation cycle, and how you can get a head start using their pre-built Spark container. This is a great conversation for understanding how new ways of operating systems can have broader impacts on how they are being used.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nFirebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Jean-Yves Stephan about Data Mechanics, a cloud-native Spark platform for data engineers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Data Mechanics and the story behind it?\nWhat are the operational characteristics of Spark that make it difficult to run in a cloud-optimized environment?\nHow do you handle retries, state redistribution, etc. when instances get pre-empted during the middle of a job execution?\n\nWhat are some of the tactics that you have found useful when designing jobs to make them more resilient to interruptions?\n\n\nWhat are the customizations that you have had to make to Spark itself?\nWhat are some of the supporting tools that you have built to allow for running Spark in a Kubernetes environment?\nHow is the Data Mechanics platform implemented?\n\nHow have the goals and design of the platform changed or evolved since you first began working on it?\n\n\nHow does running Spark in a container/Kubernetes environment change the ways that you and your customers think about how and where to use it?\n\nHow does it impact the development workflow for data engineers and data scientists?\n\n\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building the Data Mechanics product?\nWhen is Spark/Data Mechanics the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nData Mechanics\nDatabricks\nStanford\nAndrew Ng\nMining Massive Datasets\nSpark\nKubernetes\nSpot Instances\nInfiniband\nData Mechanics Spark Container Image\nDelight – Spark monitoring utility\nTerraform\nBlue/Green Deployment\nSpark Operator for Kubernetes\nJupyterHub\nJupyter Enterprise Gateway\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development and experimentation cycle, and how you can get a head start using their pre-built Spark container. This is a great conversation for understanding how new ways of operating systems can have broader impacts on how they are being used.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about how the team at Data Mechanics is bringing Apache Spark into the cloud native world and the positive impact that has on your development experience.","date_published":"2021-05-06T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/56c3347e-f23c-4390-bb18-c66fe08312c7.mp3","mime_type":"audio/mpeg","size_in_bytes":33946400,"duration_in_seconds":2415}]},{"id":"podlove-2021-05-04t01:11:47+00:00-3eb088321fa3171","title":"The Grand Vision And Present Reality of DataOps","url":"https://www.dataengineeringpodcast.com/dataops-round-table-episode-313","content_text":"Summary\nThe Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Max Beauchemin, Lior Gavish, and Kevin Stumpf about the real world challenges of embracing DataOps practices and systems, and how to keep things secure as you scale\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nBefore we get started, can you each give your definition of what \"DataOps\" means to you?\n\nHow does this differ from \"business as usual\" in the data industry?\nWhat are some of the things that DataOps isn’t (despite what marketers might say)?\n\n\nWhat are the biggest difficulties that you have faced in going from concept to production with a workflow or system intended to power self-serve access to other members of the organization?\nWhat are the weak points in the current state of the industry, whether technological or social, that contribute to your greatest sense of unease from a security perspective?\nAs founders of companies that aim to facilitate adoption of various aspects of DataOps, how are you applying the products that you are building to your own internal systems?\nHow does security factor into the design of robust DataOps systems? What are some of the biggest challenges related to security when it comes to putting these systems into production?\nWhat are the biggest differences between DevOps and DataOps, particularly when it concerns designing distributed systems?\nWhat areas of the DataOps landscape do you think are ripe for innovation?\nNowadays, it seems like new DataOps companies are cropping up every day to try and solve some of these problems. Why do you think DataOps is becoming such an important component of the modern data stack?\nThere’s been a lot of conversation recently around the \"rise of the data engineer\" versus other roles in the data ecosystem (i.e. data scientist or data analyst). Why do you think that is?\nWhat are some of the most valuable lessons that you have learned from working with your customers about how to apply DataOps principles?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building your respective platforms and businesses?\nWhat are the industry trends that you are each keeping an eye on to inform you future product direction?\n\nContact Info\n\nKevin\n\nLinkedIn\nkevinstumpf on GitHub\n@kevinstumpf on Twitter\n\n\nMaxime\n\nLinkedIn\n@mistercrunch on Twitter\nmistercrunch on GitHub\n\n\nLior\n\nLinkedIn\n@lgavish on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTecton\nMonte Carlo\nSuperset\nPreset\nBarracuda Networks\nFeature Store\nDataOps\nDevOps\nData Catalog\nAmundsen\nOpenLineage\nThe Downfall of the Data Engineer\nHashicorp Vault\nReverse ELT\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about the grand vision and current realities of DataOps and how you can start on the journey toward more maintainable and reliable data systems.","date_published":"2021-05-03T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/03a6ee0a-f03c-4aba-80e9-388b8b0ed67f.mp3","mime_type":"audio/mpeg","size_in_bytes":46119617,"duration_in_seconds":3428}]},{"id":"podlove-2021-04-25t21:51:38+00:00-d2f61aed3176aaf","title":"Self Service Data Exploration And Dashboarding With Superset","url":"https://www.dataengineeringpodcast.com/superset-data-exploration-episode-182","content_text":"\n\nSummary\nThe reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration, dashboards, and business intelligence\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Superset is?\nSuperset is becoming part of the reference architecture for a modern data stack. What are the factors that have contributed to its popularity over other tools such as Redash, Metabase, Looker, etc.?\nWhere do dashboarding and exploration tools like Superset fit in the responsibilities and workflow of a data engineer?\nWhat are some of the challenges that Superset faces in being performant when working with large data sources?\n\nWhich data sources have you found to be the most challenging to work with?\n\n\nWhat are some anti-patterns that users of Superset might run into when building out a dashboard?\nWhat are some of the ways that users can surface data quality indicators (e.g. freshness, lineage, check results, etc.) in a Superset dashboard?\nAnother trend in analytics and dashboard tools is providing actionable insights. How can Superset support those use cases where a business user or analyst wants to perform an action based on the data that they are being shown?\nHow can Superset factor into a data governance strategy for the business?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Superset used?\ndogfooding\nWhat are the most interesting, unexpected, or challenging lessons that you have learned from working on Superset and founding Preset?\nWhen is Superset the wrong choice?\nWhat do you have planned for the future of Superset and Preset?\n\nContact Info\n\nLinkedIn\n@mistercrunch on Twitter\nmistercrunch on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSuperset\n\nPodcast.__init__ Episode\n\n\nPreset\nASP (Active Server Pages)\nVBScript\nData Warehouse Institute\nRalph Kimball\nBill Inmon\nUbisoft\nHadoop\nTableau\nLooker\n\nPodcast Episode\n\n\nThe Future of Business Intelligence Is Open Source\nSupercharging Apache Superset\nRedash\n\nPodcast.__init__ Episode\n\n\nMetabase\n\nPodcast Episode\n\n\nThe Rise Of The Data Engineer\nAirBnB Data University\nPython DBAPI\nSQLAlchemy\nDruid\nSQL Common Table Expressions\nSQL Window Functions\nData Warehouse Semantic Layer\nAmundsen\n\nPodcast Episode\n\n\nOpen Lineage\nDatakin\nMarquez\n\nPodcast Episode\n\n\nApache Arrow\n\nPodcast.__init__ Episode with Wes McKinney\n\n\nApache Parquet\nDataHub\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"
\n\n

Summary

\n

The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Maxime Beauchemin about how to use Apache Superset as a platform for self-service data exploration and analytics.","date_published":"2021-04-26T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/db2aa1d1-4d02-41d6-aa16-68b0c629abe3.mp3","mime_type":"audio/mpeg","size_in_bytes":37128361,"duration_in_seconds":2844}]},{"id":"podlove-2021-04-19t02:28:14+00:00-1f3fa889097e7d2","title":"Moving Machine Learning Into The Data Pipeline at Cherre","url":"https://www.dataengineeringpodcast.com/cherre-address-data-pipeline-episode-181","content_text":"Summary\nMost of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nStarted as physicist and evolved into Data Science\nCan you start by giving a brief recap of what Cherre is and the types of data that you deal with?\nCherre is a company that connects data\nWe’re not a data vendor, in that we don’t sell data, primarily\nWe help companies connect and make sense of their data\nThe real estate market is historically closed, gut let, behind on tech\nWhat are the biggest challenges that you deal with in your role when working with real estate data?\nLack of a standard domain model in real estate.\nOntology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data.\nQUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness).\nHIREARCHY. When is one source better than another\nWhat are the teams and systems that rely on address information?\nAny company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties.\nOur clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it.\nCan you give an example for the problems involved in entity resolution\nKnown entity example.\nEmpire state buidling.\nTo resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.\n\nIdentify the type of the object (lot, building, unit)\nTag the object with all the relevant addresses\nRelations to other objects (lot, building, unit)\n\n\nWhat are some examples of the kinds of edge cases or messiness that you encounter in addresses?\nFirst class is string problems.\nSecond class component problems.\nthird class is geocoding.\nI understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved?\nWhat is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates\n\nHow were you satisfying this requirement previously?\nBefore we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses.\nWhat were the motivations for designing and implementing this as a service?\nNeed to expand nationwide and to deal with client queries in real time.\nWhat are some of the other data sources that you rely on to be able to perform this normalization and resolution?\nLot data, building data, unit data, Footprints and address points datasets.\nWhat challenges do you face in managing these other sources of information?\nAccuracy, hirearchy, standardization, unified solution, persistant ids and primary keys\n\n\nDigging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it?\nString cleaning, Parse and tokenize, standardize, Match\nWhat are some of the other pieces of information in your system that you would like to see addressed in a similar fashion?\nOur named entity solution with connection to knowledge graph and owner unmasking.\nWhat are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system?\nScaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure.\nNow that you have this system running in production, if you were to start over today what would you do differently?\na lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing\nWhat are some of the other projects that you are excited to work on going forward?\nNamed entity resolution and Knowledge Graph\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\nBigQuery is huge asset and in particular UDFs but they don’t support API calls or python script\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nCherre\n\nPodcast Episode\n\n\nPhotonics\nKnowledge Graph\nEntity Resolution\nBigQuery\nNLP == Natural Language Processing\ndbt\n\nPodcast Episode\n\n\nAirflow\n\nPodcast.__init__ Episode\n\n\nDatadog\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the team at Cherre built an internal machine learning project to use as a service in their data pipelines to make dealing with messy address data less painful.","date_published":"2021-04-19T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/19928da7-3213-484d-9882-4a678d9dc66b.mp3","mime_type":"audio/mpeg","size_in_bytes":37928325,"duration_in_seconds":2884}]},{"id":"podlove-2021-04-12t21:15:17+00:00-6139d14b7bc4d5f","title":"Exploring The Expanding Landscape Of Data Professions with Josh Benamram of Databand","url":"https://www.dataengineeringpodcast.com/databand-data-professions-episode-180","content_text":"Summary\n\"Business as usual\" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Josh Benamram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by discussing the set of roles that you see in a majority of data teams?\nWhat new roles do you see emerging, and what are the motivating factors?\n\nWhich of the more established positions are fracturing or merging to create these new responsibilities?\n\n\nWhat are the contexts in which you are seeing these role definitions used? (e.g. small teams, large orgs, etc.)\nHow do the increased granularity/specialization of responsibilities across data teams change the ways that data and platform architects need to think about technology investment?\n\nWhat are the organizational impacts of these new types of data work?\n\n\nHow do these shifts in role definition change the ways that the individuals in the position interact with the data platform?\n\nWhat are the types of questions that practitioners in different roles are asking of the data that they are working with? (e.g. what is the lineage of this asset vs. what is the distribution of values in this column, etc.)\n\n\nHow can metrics and observability data about pipelines and data systems help to support these various roles?\nWhat are the different ways of measuring data quality for the needs of these roles?\nHow is the work you are doing at Databand informed by these changing needs?\nOne of the big challenges caused by data systems is the varying modes of access and interaction across the different stakeholders and activities. How can data platform teams and vendors help to surface useful metrics and information across these various interfaces without forcing users into a new or unfamiliar workflow?\nWhat are some of the long-term impacts that you foresee in the data ecosystem and ways of interacting with data as a result of the current trend toward more specialized tasks?\nAs a vendor working to provide useful context to these practitioners what are some of the most interesting, unexpected, or challenging lessons that you have learned?\nWhat do you have planned for the future of Databand?\n\nContact Info\n\nEmail\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataband\n\nWebsite\nPlatform\nOpen Core\nMore data engineering stories & best practices\n\n\nAtlassian\nChartio\nData Mesh Article\n\nPodcast Episode\n\n\nGrafana\nMetabase\nSuperset\n\nPodcast.__init__ Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nSpark\nAirflow\n\nPodcast.__init__ Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

"Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Josh Benamram about the emerging roles across the data ecosystem and how they interact with data systems.","date_published":"2021-04-12T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d425045f-95db-4b24-8402-bfd7028d39c2.mp3","mime_type":"audio/mpeg","size_in_bytes":49972428,"duration_in_seconds":4116}]},{"id":"podlove-2021-04-05t00:10:17+00:00-73c63c468d2ec1c","title":"Put Your Whole Data Team On The Same Page With Atlan","url":"https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179","content_text":"Summary\nOne of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in the information that each role requires to do their job and where they expect to find it. This introduces a barrier to communication that is difficult to overcome, particularly in teams that have not reached a significant level of maturity in their data journey. In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. She explains how the design of the platform is informed by the needs of managing data projects for large and small teams across her previous roles, how it integrates with your existing systems, and how it can work to bring everyone onto the same page.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Prukalpa Sankar about Atlan, a modern data workspace that makes collaboration among data stakeholders easier, increasing efficiency and agility in data projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Atlan and some of the story behind it?\nWho are the target users of Atlan?\nWhat portions of the data workflow is Atlan responsible for?\n\nWhat components of the data stack might Atlan replace?\n\n\nHow would you characterize Atlan’s position in the current data ecosystem?\n\nWhat makes Atlan stand out from other systems for data cataloguing, metadata management, or data governance?\nWhat types of data assets (e.g. structured vs unstructured, textual vs binary, etc.) is Atlan designed to understand?\n\n\nCan you talk through how Atlan is implemented?\n\nHow have the goals and design of the platform changed or evolved since you first began working on it?\nWhat are some of the early assumptions that you have had to revisit or reconsider?\n\n\nWhat is involved in getting Atlan deployed and integrated into an existing data platform?\nBeyond the technical aspects, what are the business processes that teams need to implement to be successful when incorporating Atlan into their systems?\nOnce Atlan is set up, what is a typical workflow for an individual and their team to collaborate on a set of data assets, or building out a new processing pipeline?\n\nWhat are some useful steps for introducing all of the stakeholders to the system and workflow?\n\n\nWhat are the available extension points for managing data in systems that aren’t supported by Atlan out of the box?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Atlan used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Atlan?\nWhen is Atlan the wrong choice?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nLinkedIn\n@prukalpa on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nAtlan\nIndia’s National Data Platform\nWorld Economic Forum\nUN\nGates Foundation\nGitHub\nFigma\nSnowflake\nRedshift\nDatabricks\nDBT\nSisense\nLooker\nApache Atlas\nImmuta\nDataHub\nDatakin\nAapache Ranger\nGreat Expectations\nTrino\nAirflow\nDagster\nPrivacera\nDataband\nCloudformation\nGrafana\nDeequ\nWe Failed to Set Up a Data Catalog 3x. Here’s Why.\nAnalysing the analysers book\nOpenAPI\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in the information that each role requires to do their job and where they expect to find it. This introduces a barrier to communication that is difficult to overcome, particularly in teams that have not reached a significant level of maturity in their data journey. In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. She explains how the design of the platform is informed by the needs of managing data projects for large and small teams across her previous roles, how it integrates with your existing systems, and how it can work to bring everyone onto the same page.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"In this episode Prukalpa Sankar discusses how Atlan uses metadata from all of your workflows to bring everyone on the same page, letting you delivery on your data projects in record time.","date_published":"2021-04-05T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fda3ccc5-b381-4fc6-846c-a160e11e2151.mp3","mime_type":"audio/mpeg","size_in_bytes":43648298,"duration_in_seconds":3456}]},{"id":"podlove-2021-03-30t02:56:00+00:00-b8efc417155bfb3","title":"Data Quality Management For The Whole Team With Soda Data","url":"https://www.dataengineeringpodcast.com/soda-data-quality-management-episode-178","content_text":"Summary\nData quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Maarten Masschelein and Tom Baeyens about the work are doing at Soda to power data quality management\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Soda?\nWhat problem are you trying to solve?\nAnd how are you solving that problem?\n\nWhat motivated you to start a business focused on data monitoring and data quality?\n\n\nThe data monitoring and broader data quality space is a segment of the industry that is seeing a huge increase in attention recently. Can you share your perspective on the current state of the ecosystem and how your approach compares to other tools and products?\nwho have you created Soda for (e.g platform engineers, data engineers, data product owners etc) and what is a typical workflow for each of them?\nHow do you go about integrating Soda into your data infrastructure?\nHow has the Soda platform been architected?\nWhy is this architecture important?\n\nHow have the goals and design of the system changed or evolved as you worked with early customers and iterated toward your current state?\n\n\nWhat are some of the challenges associated with the ongoing monitoring and testing of data?\nwhat are some of the tools or techniques for data testing used in conjunction with Soda?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Soda being used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building the technology and business for Soda?\nWhen is Soda the wrong choice?\nWhat do you have planned for the future?\n\nContact Info\n\nMaarten\n\nLinkedIn\n@masscheleinm on Twitter\n\n\nTom\n\nLinkedIn\n@tombaeyens on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSoda Data\nSoda SQL\nRedHat\nCollibra\nSpark\nGetting Things Done by David Allen (affiliate link)\nSlack\nOpsGenie\nDBT\nAirflow\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Soda Data platform and the open source components that they are building to level up the quality of your data pipelines.","date_published":"2021-03-29T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2064c6de-eb10-41b1-9d28-a034d55fb1e2.mp3","mime_type":"audio/mpeg","size_in_bytes":49604750,"duration_in_seconds":3480}]},{"id":"podlove-2021-03-23t00:08:45+00:00-d191ced49ea329c","title":"Real World Change Data Capture At Datacoral","url":"https://www.dataengineeringpodcast.com/datacoral-change-data-capture-episode-177","content_text":"Summary\nThe world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Raghu Murthy about his recent work of making change data capture more accessible and maintainable\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what CDC is and when it is useful?\nWhat are the alternatives to CDC?\n\nWhat are the cases where a more batch-oriented approach would be preferable?\n\n\nWhat are the factors that you need to consider when deciding whether to implement a CDC system for a given data integration?\n\nWhat are the barriers to entry?\n\n\nWhat are some of the common mistakes or misconceptions about CDC that you have encountered in your own work and while working with customers?\nHow does CDC fit into a broader data platform, particularly where there are likely to be other data integration pipelines in operation? (e.g. Fivetran/Airbyte/Meltano/custom scripts)\nWhat are the moving pieces in a CDC workflow that need to be considered as you are designing the system?\n\nWhat are some examples of the configuration changes necessary in source systems to provide the needed log data?\n\n\nHow would you characterize the current landscape of tools available off the shelf for building a CDC pipeline?\n\nWhat are your predictions about the potential for a unified abstraction layer for log-based CDC across databases?\n\n\nWhat are some of the potential performance/uptime impacts on source databases, both during the initial historical sync and once you hit a steady state?\n\nHow can you mitigate the impacts of the CDC pipeline on the source databases?\n\n\nWhat are some of the implementation details that application developers DBAs need to be aware of for data modeling in the source systems to allow for proper replication via CDC?\nAre there any performance challenges that need to be addressed in the consumers or destination systems? e.g. parallelism\nCan you describe the technical implementation and architecture that you use for implementing CDC?\n\nHow has the design evolved as you have grown the scale and sophistication of your system?\n\n\nIn the destination system, what data modeling decisions need to be made to ensure that the replicated information is usable for anlytics?\n\nWhat additional attributes need to be added to track things like row modifications, deletions, schema changes, etc.?\nHow do you approach treatment of data copies in the DWH? (e.g. ELT – keep all source tables and use DBT for converting relevant tables into star/snowflake/data vault/wide tables)\n\n\nWhat are your thoughts on the viability of a data lake as the destination system? (e.g. S3/Parquet or Trino/Drill/etc.)\nCDC is a topic that is generally reserved for coversations about databases, but what are some of the other systems that we could think about implementing CDC? e.g. APIs and third party data sources\nHow can we integrage CDC into metadata/lineage tooling?\nHow do you handle observability of CDC flows?\n\nWhat is involved in debugging a replication flow?\n\n\nHow can we build data quality checks into CDC workflows?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen CDC used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned from digging deep into CDC implementation?\nWhen is CDC the wrong choice?\nWhat are some of the industry or technology trends around CDC that you are most excited by?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataCoral\n\nPodcast Episode\n\n\nDataCoral Blog\n\n3 Steps To Build A Modern Data Stack\nChange Data Capture: Overview\n\n\nHive\nHadoop\nDBT\n\nPodcast Episode\n\n\nFiveTran\n\nPodcast Episode\n\n\nChange Data Capture\nMetadata First Blog Post\nDebezium\n\nPodcast Episode\n\n\nUUID == Universally Unique Identifier\nAirflow\nOracle Goldengate\nParquet\nTrino\nAWS Lambda\nData Mesh\n\nPodcast Episode\n\n\nEnterprise Message Bus\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Raghu Murthy about the reality of running and maintaining change data capture pipelines for customers at Datacoral.","date_published":"2021-03-22T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/20de95e3-2de8-4f10-9e23-0e2b2d2cab8f.mp3","mime_type":"audio/mpeg","size_in_bytes":39110657,"duration_in_seconds":2998}]},{"id":"podlove-2021-03-15t23:41:34+00:00-249a4cbee67e41e","title":"Managing The DoorDash Data Platform","url":"https://www.dataengineeringpodcast.com/doordash-data-platform-episode-176","content_text":"Summary\nThe team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving a quick overview of what you do at DoorDash?\n\nWhat are some of the ways that data is used to power the business?\n\n\nHow has the pandemic affected the scale and volatility of the data that you are working with?\nCan you describe the type(s) of data that you are working with?\n\nWhat are the primary sources of data that you collect?\n\nWhat secondary or third party sources of information do you rely on?\n\n\nCan you give an overview of the collection process for that data?\n\n\nIn selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision?\nIn your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a \"paved path\" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the \"paved path\"?\nHow are you managing quality and consistency of your data across its lifecycle?\n\nWhat are some of the specific data quality solutions that you have integrated into the platform and \"paved path\"?\n\n\nWhat are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled?\n\nHow do you manage the migration path for adopting new technologies or techniques?\n\n\nIn the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component?\nHow do you identify and tracking cost factors in the data platform?\n\nWhat do you do with that information?\n\n\nWhat is your approach for identifying and measuring useful OKRs (Objectives and Key Results)?\n\nHow do you quantify potentially subjective metrics such as reliability and quality?\n\n\nHow have you designed the organizational structure for your data teams?\n\nWhat are the responsibilities and organizational interfaces for data engineers within the company?\nHow have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business?\n\n\nWhat are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash?\nWhat are some of the upcoming projects or changes that you anticipate in the near to medium future?\n\nContact Info\n\nLinkedIn\n@stonse on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nHow DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand\nDoorDash\nUber\nNetscape\nNetflix\nChange Data Capture\nDebezium\n\nPodcast Episode\n\n\nSnowflakeDB\n\nPodcast Episode\n\n\nAirflow\n\nPodcast.__init__ Episode\n\n\nKafka\nFlink\n\nPodcast Episode\n\n\nPinot\nGDPR\nCCPA\nData Governance\nAWS\nLightGBM\nXGBoost\nBig Data Landscape\nKinesis\nKafka Connect\nCassandra\nPostgreSQL\n\nPodcast Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nSQS\nFeature Toggles\nBigEye\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with Sudhir Tonse about his work at DoorDash to help build a data platform that scales to meet the self service needs of analysts and data scientists.","date_published":"2021-03-15T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2711d89b-d8f9-4fb9-8c7c-d84eb94e334a.mp3","mime_type":"audio/mpeg","size_in_bytes":29160445,"duration_in_seconds":2764}]},{"id":"podlove-2021-03-09t00:07:49+00:00-fe4f0ca30aa12fb","title":"Leave Your Data Where It Is And Automate Feature Extraction With Molecula","url":"https://www.dataengineeringpodcast.com/molecula-feature-store-episode-175","content_text":"Summary\nA majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Molecula and the story behind it?\n\nWhat are the additional capabilities that Molecula offers on top of the open source Pilosa project?\n\n\nWhat are the problems/use cases that Molecula solves for?\nWhat are some of the technologies or architectural patterns that Molecula might replace in a companies data platform?\nOne of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.?\n\nWhat are the benefits of using a bitmap index for identifying and computing features?\n\n\nCan you describe how the Molecula platform is architected?\n\nHow has the design and goal of Molecula changed or evolved since you first began working on it?\n\n\nFor someone who is using Molecula, can you describe the process of integrating it with their existing data sources?\nCan you describe the internal data model of Pilosa/Molecula?\n\nHow should users think about data modeling and architecture as they are loading information into the platform?\n\n\nOnce a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering?\nWhat are some of the most underutilized or misunderstood capabilities of Molecula?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula?\nWhen is Molecula the wrong choice?\nWhat do you have planned for the future of the platform and business?\n\nContact Info\n\nLinkedIn\n@maycotte on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMolecula\nPilosa\n\nPodcast Episode\n\n\nThe Social Dilemma\nFeature Store\nCassandra\nElasticsearch\n\nPodcast Episode\n\n\nDruid\nMongoDB\nSwimOS\n\nPodcast Episode\n\n\nKafka\nKafka Schema Registry\n\nPodcast Episode\n\n\nHomomorphic Encryption\nLucene\nSolr\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with H.O. Maycotte about how the Molecula platform and the underlying Pilosa engine allows you to automatically do feature extraction from your data without having to centralize it.","date_published":"2021-03-08T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/62528a19-cd43-4ecf-bb44-08c682b8ebac.mp3","mime_type":"audio/mpeg","size_in_bytes":34775623,"duration_in_seconds":3099}]},{"id":"podlove-2021-03-02t01:37:54+00:00-c33c37fc42f31a5","title":"Bridging The Gap Between Machine Learning And Operations At Iguazio","url":"https://www.dataengineeringpodcast.com/iguazio-machine-learning-operations-episode-174","content_text":"Summary\nThe process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Yaron Haviv about Iguazio, a platform for end to end automation of machine learning applications using MLOps principles.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data science & analytics?\nCan you start by giving an overview of what Iguazio is and the story of how it got started?\nHow would you characterize your target or typical customer?\nWhat are the biggest challenges that you see around building production grade workflows for machine learning?\n\nHow does Iguazio help to address those complexities?\n\n\nFor customers who have already invested in the technical and organizational capacity for data science and data engineering, how does Iguazio integrate with their environments?\nWhat are the responsibilities of a data engineer throughout the different stages of the lifecycle for a machine learning application?\nCan you describe how the Iguazio platform is architected?\n\nHow has the design of the platform evolved since you first began working on it?\nHow have the industry best practices around bringing machine learning to production changed?\n\n\nHow do you approach testing/validation of machine learning applications and releasing them to production environments? (e.g. CI/CD)\nOnce a model is in production, what are the types and sources of information that you collect to monitor their performance?\n\nWhat are the factors that contribute to model drift?\n\n\nWhat are the remaining gaps in the tooling or processes available for managing the lifecycle of machine learning projects?\nWhat are the most interesting, innovative, or unexpected ways that you have seen the Iguazio platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and scaling the Iguazio platform and business?\nWhen is Iguazio the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n@yaronhaviv on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nIguazio\nMLOps\nOracle Exadata\nSAP HANA\nMellanox\nNVIDIA\nMulti-Model Database\nNuclio\nMLRun\nJupyter Notebook\nPandas\nScala\nFeature Imputing\nFeature Store\nParquet\nSpark\nApache Flink\n\nPodcast Episode\n\n\nApache Beam\nNLP (Natural Language Processing)\nDeep Learning\nBERT\nAirflow\n\nPodcast.__init__ Episode\n\n\nDagster\n\nData Engineering Podcast Episode\nPodcast.__init__ Episode\n\n\nKubeflow\nArgo\nAWS Step Functions\nPresto/Trino\n\nPodcast Episode\n\n\nDask\n\nPodcast Episode\n\n\nHadoop\nSagemaker\nTecton\n\nPodcast Episode\n\n\nSeldon\nDataRobot\nRapidMiner\nH2O.ai\nGrafana\nStorey\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Iguazio platform reduces the friction in bringing your machine learning workloads to production in a fast and maintainable way.","date_published":"2021-03-01T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/10bd33e8-278c-4cc0-a11e-cccc25feb1bc.mp3","mime_type":"audio/mpeg","size_in_bytes":39847329,"duration_in_seconds":3987}]},{"id":"podlove-2021-02-20t12:44:24+00:00-fd704ce002ce8d2","title":"Self Service Open Source Data Integration With AirByte","url":"https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173","content_text":"Summary\nData integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Airbyte is and the story behind it?\nBusinesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space?\nHow would you characterize your target users?\n\nHow have those personas instructed the priorities and design of Airbyte?\nWhat do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?\n\n\nwhat are the complex/challenging elements of data integration that makes it such a slippery problem?\nmotivation for creating open source ELT as a business\nCan you describe how the Airbyte platform is implemented?\n\nWhat was your motivation for choosing Java as the primary language?\n\n\nincidental complexity of forcing all connectors to be packaged as containers\nshortcomings of the Singer specification/motivation for creating a backwards incompatible interface\nperceived potential for community adoption of Airbyte specification\ntradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.\n\ninformation lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)\n\n\ninterfaces/extension points for integrating with other tools, e.g. Dagster\nabstraction layers for simplifying implementation of new connectors\ntradeoffs of storing all connectors in a monorepo with the Airbyte core\n\nimpact of community adoption/contributions\n\n\nWhat is involved in setting up an Airbyte installation?\nWhat are the available axes for scaling an Airbyte deployment?\nchallenges of setting up and maintaining CI environment for Airbyte\nHow are you managing governance and long term sustainability of the project?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte?\nWhen is Airbyte the wrong choice?\nWhat do you have planned for the future of the project?\n\nContact Info\n\nMichel\n\nLinkedIn\n@MichelTricot on Twitter\nmichel-tricot on GitHub\n\n\nJohn\n\nLinkedIn\n@JeanLafleur on Twitter\njohnlafleur on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAirbyte\nLiveramp\nFivetran\n\nPodcast Episode\n\n\nStitch Data\nMatillion\nDataCoral\n\nPodcast Episode\n\n\nSinger\nMeltano\n\nPodcast Episode\n\n\nAirflow\n\nPodcast.__init__ Episode\n\n\nKotlin\nDocker\nMonorepo\nAirbyte Specification\nGreat Expectations\n\nPodcast Episode\n\n\nDagster\n\nData Engineering Podcast Episode\nPodcast.__init__ Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nDBT\n\nPodcast Episode\n\n\nKubernetes\nSnowflake\n\nPodcast Episode\n\n\nRedshift\nPresto\nSpark\nParquet\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the open source data integration platform Airbyte and how you can use it to provide self service access to your data consumers.","date_published":"2021-02-22T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b279c024-413a-48a1-beb1-dd09475b3b22.mp3","mime_type":"audio/mpeg","size_in_bytes":37962202,"duration_in_seconds":3135}]},{"id":"podlove-2021-02-14t01:58:36+00:00-049374823d3077b","title":"Building The Foundations For Data Driven Businesses at 5xData","url":"https://www.dataengineeringpodcast.com/5xdata-data-driven-foundations-episode-172","content_text":"Summary\nEvery business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.\nYour host is Tobias Macey and today I’m interviewing Tarush Aggarwal about his mission at 5xData to teach companies how to build solid foundations for their data capabilities\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at 5xData and the story behind it?\nimpact of industry on challenges in becoming data driven\nprofile of companies that you are trying to work with\ncommon mistakes when designing data platform\nmisconceptions that the business has around how to invest in data\nchallenges in attracting/interviewing/hiring data talent\nWhat are the core components that you have standardized on for building the foundational layers of the data platform?\nproviding context and training to business users in order to allow them to self-serve the answers to their questions\n\ntooling/interfaces needed to allow them to ask and investigate questions\n\n\nmost high impact areas for data engineers to focus on in the initial stages of implementing the data platform\nhow to identify and prioritize areas of effort\nuseful structure of data team at different stages of maturity\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building out the business and team of 5xData?\nWhat do you have planned for the future of the business?\nWhat are the industry trends or specific technologies that you are keeping a close watch on?\n\nContact Info\n\nLinkedIn\n@tarush on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\n5xData\nLooker\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nFivetran\n\nPodcast Episode\n\n\nDBT\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Every business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tarush Aggarwal about his work at 5xData to help more companies build the foundations they need to become truly data driven.","date_published":"2021-02-15T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5cff8679-58ab-4874-ac5b-b369d2770397.mp3","mime_type":"audio/mpeg","size_in_bytes":32853234,"duration_in_seconds":3135}]},{"id":"podlove-2021-02-09t01:36:44+00:00-3bc885d557cf40b","title":"How Shopify Is Building Their Production Data Warehouse Using DBT","url":"https://www.dataengineeringpodcast.com/shopify-data-warehouse-with-dbt-episode-171","content_text":"Summary\nWith all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nToday’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.\nYour host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what the Shopify platform is?\nWhat kinds of data sources are you working with?\n\nCan you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage?\nHow have you structured your data teams to be able to deliver those projects?\n\n\nWhat are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users?\nWhat was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse?\nWhat were your criteria when selecting a platform for your data warehouse?\n\nWhat decision did that criteria lead you to make?\n\n\nOnce you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies?\n\nWhat were your criteria for determining what toolchain to use for managing the data warehouse?\nYou ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates?\n\n\nWhat was your process for onboarding users into the DBT toolchain and determining how to structure the project layout?\n\nWhat are some of the shortcomings or edge cases that you ran into?\n\n\nRather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs?\n\nWhat has been your experience with extending and integrating with DBT to customize it for your environment?\n\n\nCan you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for?\n\nHow much of the testing are you able to do with out-of-the-box functionality from DBT?\nWhat are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes?\nCan you share how you manage the CI/CD process for changes in your data warehouse?\n\nWhat kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines?\n\n\n\n\nHow do you integrate the management of your data warehouse and DBT workflows with your broader data platform?\nNow that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale?\n\nAre there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it?\n\n\nWhat are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse?\nWhat is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.)\nAs you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned?\nWhat are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates?\nWhat are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions?\nWhat are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT?\n\nContact Info\n\nZeeshan\n\n@zeeshanq on Twitter\nWebsite\n\n\nMichelle\n\n@michellearky on Twitter\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nHow to Build a Production Grade Workflow with SQL Modelling\nShopify\nJRuby\nPySpark\nDruid\nAmplitude\nMode\nSnowflake Schema\nData Vault\n\nPodcast Episode\n\n\nBigQuery\nAmazon Redshift\nCI/CD\nGreat Expectations\n\nPodcast Episode\n\n\nMaster Data Management\n\nPodcast Episode\n\n\nFlink SQL\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with Shopify's engineers about how they are using DBT to build a data warehouse platform that scales to meet the needs of the business.","date_published":"2021-02-08T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e60aadc4-031f-4566-afe3-c9bed8f9d233.mp3","mime_type":"audio/mpeg","size_in_bytes":33032969,"duration_in_seconds":2790}]},{"id":"podlove-2021-02-02t02:55:13+00:00-4cd5f86ef7f03a2","title":"System Observability For The Cloud Native Era With Chronosphere","url":"https://www.dataengineeringpodcast.com/chronosphere-metrics-observability-episode-170","content_text":"Summary\nCollecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nToday’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.\nYour host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Chronosphere and your motivation for turning it into a business?\nWhat are the biggest challenges inherent to monitoring use cases?\n\nHow does the advent of cloud native environments complicate things further?\n\n\nWhile you were at Uber you helped to create the M3 storage engine. There are a wide array of time series databases available, including many purpose built for metrics use cases. What were the missing pieces that made it necessary to create a new system?\nHow do you handle schema design/data modeling for metrics storage?\nHow do the usage patterns of metrics systems contribute to the complexity of building a storage layer to support them?\n\nWhat are the optimizations that need to be made for the read and write paths in M3?\n\n\nHow do you handle high cardinality of metrics and ad-hoc queries to understand system behaviors?\nWhat are the scaling factors for M3?\nCan you describe how you have architected the Chronosphere platform?\nWhat are the convenience features built on top of M3 that you are creating at Chronosphere?\nHow do you handle deployment and scaling of your infrastructure given the scale of the businesses that you are working with?\nBeyond just server infrastructure and application behavior, what are some of the other sources of metrics that you and your users are sending into Chronosphere?\n\nHow do those alternative metrics sources complicate the work of generating useful insights from the data?\n\n\nIn addition to the read and write loads, metrics systems also need to be able to identify patterns, thresholds, and anomalies in the data to alert on it with minimal latency. How do you handle that in the Chronosphere platform?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Chronosphere/M3 used?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building Chronosphere?\nWhen is Chronosphere the wrong choice?\nWhat do you have planned for the future of the platform and business?\n\nContact Info\n\nLinkedIn\n@roskilli on Twitter\nrobskillington on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nChronosphere\nLidar\nCloud Native\nM3DB\nOpenTracing\nMetrics/Telemetry\nGraphite\n\nPodcast.__init__ Episode\n\n\nInfluxDB\nClickhouse\n\nPodcast Episode\n\n\nPrometheus\nInverted Index\nDruid\nCardinality\nApache Flink\n\nPodcast Episode\n\n\nHDFS\nAvro\n\nPodcast Episode\n\n\nGrafana\nTecton\n\nPodcast Episode\n\n\nDatadog\n\nPodcast Episode\n\n\nKubernetes\nSourcegraph\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview about the Chronosphere platform and the M3DB storage engine for managing system metrics to power observability in the cloud native era.","date_published":"2021-02-01T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/00b7f9e1-407a-4945-b5dd-a1fbbd974666.mp3","mime_type":"audio/mpeg","size_in_bytes":43877972,"duration_in_seconds":3890}]},{"id":"podlove-2021-01-26t00:11:09+00:00-a4eb4a98cd961a4","title":"Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue","url":"https://www.dataengineeringpodcast.com/hotglue-data-integration-episode-169","content_text":"Summary\nBusinesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nThis episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog.\nYour host is Tobias Macey and today I’m interviewing David Molot and Hassan Syyid about Hotglue, an embeddable data integration tool for B2B developers built on the Python ecosystem.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Hotglue?\n\nWhat was your motivation for starting a business to address this particular problem?\n\n\nWho is the target user of Hotglue and what are their biggest data problems?\n\nWhat are the types and sources of data that they are likely to be working with?\nHow are they currently handling solutions for those problems?\nHow does the introduction of Hotglue simplify or improve their work?\n\n\nWhat is involved in getting Hotglue integrated into a given customer’s environment?\nHow is Hotglue itself implemented?\n\nHow has the design or goals of the platform evolved since you first began building it?\nWhat were some of the initial assumptions that you had at the outset and how well have they held up as you progressed?\n\n\nOnce a customer has set up Hotglue what is their workflow for building and executing an ETL workflow?\n\nWhat are their options for working with sources that aren’t supported out of the box?\n\n\nWhat are the biggest design and implementation challenges that you are facing given the need for your product to be embedded in customer platforms and exposed to their end users?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Hotglue used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Hotglue?\nWhen is Hotglue the wrong choice?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nDavid\n\n@davidmolot on Twitter\nLinkedIn\n\n\nHassan\n\nhsyyid on GitHub\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nHotglue\nPython\n\nThe Python Podcast.__init__\n\n\nB2B == Business to Business\nMeltano\n\nPodcast Episode\n\n\nAirbyte\nSinger\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of the Hotglue platform about how they are helping B2B application developers build data integration pipelines for working with customer information.","date_published":"2021-01-25T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8a77a30a-51b9-41c9-a78c-71276e074713.mp3","mime_type":"audio/mpeg","size_in_bytes":25311347,"duration_in_seconds":2045}]},{"id":"podlove-2021-01-19t02:04:12+00:00-ea63a42a56371f7","title":"Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch","url":"https://www.dataengineeringpodcast.com/hightouch-customer-data-warehouse-episode-168","content_text":"Summary\nThe data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nThis episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog.\nYour host is Tobias Macey and today I’m interviewing Tejas Manohar about Hightouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Hightouch and your motivation for creating it?\nWhat are the main points of friction for teams who are trying to make use of customer data?\nWhere is Hightouch positioned in the ecosystem of customer data tools such as Segment, Mixpanel, Amplitude, etc.?\nWho are the target users of Hightouch?\n\nHow has that influenced the design of the platform?\n\n\nWhat are the baseline attributes necessary for Hightouch to populate downstream systems?\n\nWhat are the data modeling considerations that users need to be aware of when sending data to other platforms?\n\n\nCan you describe how Hightouch is architected?\n\nHow has the design of the platform evolved since you first began working on it?\n\n\nWhat goals or assumptions did you have when you first began building Hightouch that have been modified or invalidated once you began working with customers?\nCan you talk through the workflow of using Hightouch to propagate data to other platforms?\n\nHow do you keep data up to date between the warehouse and downstream systems?\n\n\nWhat are the upstream systems that users need to have in place to make Hightouch a viable and effective tool?\nWhat are the benefits of using the data warehouse as the source of truth for downstream services?\nWhat are the trends in data warehousing that you are keeping a close eye on?\n\nWhat are you most excited for?\nAre there any that you find worrisome?\n\n\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Hightouch used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Hightouch?\nWhen is Hightouch the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n@tejasmanohar on Twitter\ntejasmanoharon GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nHightouch\nSegment\n\nPodcast Episode\n\n\nDBT\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\nChange Data Capture\n\nPodcast Episode\n\n\nDatabase Trigger\nMaterialize\n\nPodcast Episode\n\n\nFlink\n\nPodcast Episode\n\n\nZapier\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An episode about the Hightouch platform and how it allows you to maintain a single source of truth for all of your customer data in your data warehouse and keep all of your downstream systems accurate and up to date.","date_published":"2021-01-18T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7cf118b4-9460-463d-9311-7a69372c7246.mp3","mime_type":"audio/mpeg","size_in_bytes":40767708,"duration_in_seconds":3573}]},{"id":"podlove-2021-01-11t23:42:35+00:00-17e2ba56eff4441","title":"Enabling Version Controlled Data Collaboration With TerminusDB","url":"https://www.dataengineeringpodcast.com/terminusdb-data-collaboration-episode-167","content_text":"Summary\nAs data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nDo you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.\nYou invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!\nYour host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what TerminusDB is and what motivated you to build it?\nWhat are the use cases that TerminusDB and TerminusHub are designed for?\nThere are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today?\nCan you describe how TerminusDB is implemented?\n\nHow has the design changed or evolved since you first began working on it?\nWhat was the decision process and design considerations that led you to choose Prolog as the implementation language?\n\n\nOne of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB?\nWhat are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.)\nHow does the use of RDF triples and JSON-LD impact the audience for TerminusDB?\nHow much overhead is incurred by maintaining a long history of changes for a database?\n\nHow do you handle garbage collection/compaction of versions?\n\n\nHow does the availability of branching and merging strategies change the approach that data teams take when working on a project?\nWhat are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations?\nWhat are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB?\nAnother interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used?\nhttps://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub?\nWhen is TerminusDB the wrong choice?\nWhat do you have planned for the future of the project?\n\nContact Info\n\n@GavinMGleason on Twitter\nLinkedIn\nGavinMendelGleason on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTerminusDB\nTerminusHub\nChem Informatics\nType Theory\nGraph Database\nTrinity College Dublin\nSesshat Databank analytics over civilizations in history\nPostgreSQL\nDGraph\nGrakn\nNeo4J\nDatomic\nLakeFS\nDVC\nDolt\nPersistent Succinct Data Structure\nCurrying\nProlog\nWOQL TerminusDB query language\nRDF\nJSON-LD\nSemantic Web\nProperty Graph\nHypergraph\nSuper Node\nBloom Filters\nData Curation\n\nPodcast Episode\n\n\nCRDT == Conflict-Free Replicated Data Types\n\nPodcast Episode\n\n\nSPARQL\nDatalog\nAST == Abstract Syntax Tree\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the TerminusDB platform and how it supports data collaboration through a version controlled graph storage engine.","date_published":"2021-01-11T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a72510b5-b6a0-4c55-87e1-fc20cb724d96.mp3","mime_type":"audio/mpeg","size_in_bytes":38185192,"duration_in_seconds":3468}]},{"id":"podlove-2021-01-03t21:12:09+00:00-a63e8dff65b5735","title":"Bringing Feature Stores and MLOps to the Enterprise at Tecton","url":"https://www.dataengineeringpodcast.com/tecton-mlops-feature-store-episode-166","content_text":"Summary\nAs more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nDo you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.\nYou invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!\nYour host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Tecton and your motivation for starting the business?\nFor anyone who isn’t familiar with the concept, what is an example of a feature?\nHow do you define what a feature store is?\nWhat role does a feature store play in the overall lifecycle of a machine learning project?\nHow would you characterize the current landscape of feature stores?\nWhat are the other components that are necessary for a complete ML operations platform?\nAt what points in the lifecycle of data does the feature store get integrated?\nWhat types of data can feature stores manage? (e.g. text vs. image/binary vs. spatial, etc.)\nHow is the Tecton platform implemented?\n\nHow has the design evolved since you first began building it?\n\nHow did your work on Uber’s Michelangelo inform your work on Tecton?\n\n\n\n\nWhat is the workflow and lifecycle of developing, testing, and deploying a feature to a feature store?\nWhat aspects of a feature do you monitor to determine whether it has drifted?\n\nHow do you define drift in the context of a feature?\n\nHow does that differ from drift in an ML model?\n\n\n\n\nHow does Tecton handle versioning of features and associating those different versions with the models that are using them?\nWhat are some of the most interesting, innovative, or unexpected projects that you have seen built with Tecton?\nWhen is Tecton the wrong choice?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nLinkedIn\nkevinstumpf on GitHub\n@kevinstumpf on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTecton\nUber Michelangelo\nMLOps\nFeature Store\nBlog: What Is A Feature Store\nStreamSQL\n\nPodcast Episode\n\n\nAWS Feature Store\nLogical Clocks\nEMR\nKotlin\nDynamoDB\nscikit-learn\nTensorflow\nMLFlow\nAlgorithmia\nSageMaker\nFeast open source feature store\nJaeger\nOpenTelemetry\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Kevin Stumpf, CTO of Tecton, about his work building an enterprise grade feature store and how it functions as the core element of an MLOps strategy.","date_published":"2021-01-04T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/db2ad644-7832-4ad7-9a7a-80f596786e49.mp3","mime_type":"audio/mpeg","size_in_bytes":33017889,"duration_in_seconds":2860}]},{"id":"podlove-2020-12-28t14:53:43+00:00-0325b738b91696b","title":"Off The Shelf Data Governance With Satori","url":"https://www.dataengineeringpodcast.com/satori-cloud-data-governance-episode-165","content_text":"Summary\nOne of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYour host is Tobias Macey and today I’m interviewing Yoav Cohen about Satori, a data access service to monitor, classify and control access to sensitive data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you have built at Satori?\n\nWhat is the story behind the product and company?\n\n\nHow does Satori compare to other tools and products for managing access control and governance for data assets?\nWhat are the biggest challenges that organizations face in establishing and enforcing policies for their data?\nWhat are the main goals for the Satori product and what use cases does it enable?\nCan you describe how the Satori platform is architected?\n\nHow has the design of the platform evolved since you first began working on it?\n\n\nHow have your experiences working in cyber security informed your approach to data governance?\nHow does the design of the Satori platform simplify technical aspects of data governance?\n\nWhat aspects of governance do you delegate to other systems or platforms?\n\n\nWhat elements of data infrastructure does Satori integrate with?\n\nFor someone who is adopting Satori, what is involved in getting it deployed and set up with their existing data platforms?\n\n\nWhat do you see as being the most complex or underserved aspects of data governance?\n\nHow much of that complexity is inherent to the problem vs. being a result of how the industry has evolved?\n\n\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen the Satori platform used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Satori?\nWhen is Satori the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n@yoavcohen on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSatori\nData Governance\nData Masking\nTLS == Transport Layer Security\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Satori data governance and access control platform is architected and how it helps you manage compliance with data privacy regulations.","date_published":"2020-12-28T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9870673e-bdd2-4a36-aee2-1343dc82b84d.mp3","mime_type":"audio/mpeg","size_in_bytes":25570308,"duration_in_seconds":2064}]},{"id":"podlove-2020-12-21t22:40:32+00:00-49a97fdfb9bb940","title":"Low Friction Data Governance With Immuta","url":"https://www.dataengineeringpodcast.com/immuta-data-governance-episode-164","content_text":"Summary\nData governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nFeature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan.\nYou invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!\nYour host is Tobias Macey and today I’m interviewing Steve Touw and Stephen Bailey about Immuta and how they work to automate data governance\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you have built at Immuta and your motivation for starting the company?\nWhat is data governance?\n\nHow much of data governance can be solved with technology and how much is a matter of process and communication?\n\n\nWhat does the current landscape of data governance solutions look like?\n\nWhat are the motivating factors that would lead someone to choose Immuta as a component of their data governance strategy?\n\n\nHow does Immuta integrate with the broader ecosystem of data tools and platforms?\n\nWhat other workflows or activities are necessary outside of Immuta to ensure a comprehensive governance/compliance strategy?\n\n\nWhat are some of the common blind spots when it comes to data governance?\nHow is the Immuta platform architected?\n\nHow have the design and goals of the system evolved since you first started building it?\n\n\nWhat is involved in adopting Immuta for an existing data platform?\n\nOnce an organization has integrated Immuta, what are the workflows for the different stakeholders of the data?\n\n\nWhat are the biggest challenges in automated discovery/identification of sensitive data?\n\nHow does the evolution of what qualifies as sensitive complicate those efforts?\n\n\nHow do you approach the challenge of providing a unified interface for access control and auditing across different systems (e.g. BigQuery, Snowflake, RedShift, etc.)?\nWhat are the complexities that creep into data masking?\n\nWhat are some alternatives for obfuscating and managing access to sensitive information?\n\n\nHow do you handle managing access control/masking/tagging for derived data sets?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building Immuta?\nWhen is Immuta the wrong choice?\nWhat do you have planned for the future of the platform and business?\n\nContact Info\n\nSteve\n\nLinkedIn\n@steve_touw on Twitter\n\n\nStephen\n\nLinkedIn\nWebsite\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nImmuta\nData Governance\nData Catalog\nSnowflake DB\n\nPodcast Episode\n\n\nLooker\n\nPodcast Episode\n\n\nCollibra\nABAC == Attribute Based Access Control\nRBAC == Role Based Access Control\nPaul Ohm: Broken Promises of Privacy\nPET == Privacy Enhancing Technologies\nK Anonymization\nDifferential Privacy\nLDAP == Lightweight Directory Access Protocol\nActive Directory\nCOVID Alliance\nHIPAA\nGDPR\nCCPA\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Immuta platform simplifies the work of managing access control and data security as part of your data governance strategy.","date_published":"2020-12-21T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/65b680b4-1baf-4b64-a0fc-e896d7b9c961.mp3","mime_type":"audio/mpeg","size_in_bytes":36909404,"duration_in_seconds":3213}]},{"id":"podlove-2020-12-15t01:45:38+00:00-8f18a07750dd17a","title":"Building A Self Service Data Platform For Alternative Data Analytics At YipitData","url":"https://www.dataengineeringpodcast.com/yipitdata-alternative-data-analytics-episode-163","content_text":"\n\nSummary\nAs a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYour host is Tobias Macey and today I’m interviewing Andrew Gross, Bobby Muldoon, and Anup Segu about they are building pipelines at Yipit Data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what YipitData does?\nWhat kinds of data sources and data assets are you working with?\nWhat is the composition of your data teams and how are they structured?\nGiven the use of your data products in the financial sector how do you handle monitoring and alerting around data quality?\n\nFor web scraping in particular, given how fragile it can be, what have you done to make it a reliable and repeatable part of the data pipeline?\n\n\nCan you describe how your data platform is implemented?\n\nHow has the design of your platform and its goals evolved or changed?\n\n\nWhat is your guiding principle for providing an approachable interface to analysts?\n\nHow much knowledge do your analysts require about the guarantees offered, and edge cases to be aware of in the underlying data and its processing?\n\n\nWhat are some examples of specific tools that you have built to empower your analysts to own the full lifecycle of the data that they are working with?\nCan you characterize or quantify the benefits that you have seen from training the analysts to work with the engineering tool chain?\nWhat have been some of the most interesting, unexpected, or surprising outcomes of how you are approaching the different responsibilities and levels of ownership in your data organization?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned from building out the platform, tooling, and organizational structure for creating data products at Yipit?\nWhat advice or recommendations do you have for other leaders of data teams about how to think about the organizational and technical aspects of managing the lifecycle of data projects?\n\nContact Info\n\nAndrew\n\nLinkedIn\n@awgross on Twitter\n\n\nBobby\n\nLinkedIn\n@TheDooner64\n\n\nAnup\n\nLinkedIn\nanup-segu on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nYipit Data\nRedshift\nMySQL\nAirflow\nDatabricks\nGroupon\nLiving Social\nWeb Scraping\n\nPodcast.__init__ Episode\n\n\nReadypipe\nGraphite\n\nPodcast.init Episode\n\n\nAWS Kinesis Firehose\nParquet\nPapermill\n\nPodcast Episode About Notebooks At Netflix\n\n\nFivetran\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"
\n\n

Summary

\n

As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the YipitData team about how they built a self service platform for building analytics products on alternative data sets to power investment strategies.","date_published":"2020-12-14T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/385dca54-aa71-459e-9e85-ea6e0af05886.mp3","mime_type":"audio/mpeg","size_in_bytes":48582698,"duration_in_seconds":3887}]},{"id":"podlove-2020-12-08t15:36:54+00:00-46643837c5b1265","title":"Proven Patterns For Building Successful Data Teams","url":"https://www.dataengineeringpodcast.com/data-teams-book-episode-162","content_text":"Summary\nBuilding data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYour host is Tobias Macey and today I’m interviewing Jesse Anderson about best practices for organizing and managing data teams\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of how you view the mission and responsibilities of a data team?\n\nWhat are the critical elements of a successful data team?\nBeyond the core pillars of data science, data engineering, and operations, what other specialized roles do you find helpful for larger or more sophisticated teams?\n\n\nFor organizations that have \"small data\", how does that change the necessary composition of roles for successful data projects?\nWhat are the signs and symptoms that point to the need for a dedicated team that focuses on data?\nWith data scientists and data engineers in particular being in such high demand, what are strategies that you have found effective for attracting new talent?\n\nIn the case where you have engineers on staff, how do you identify internal talent that can be trained into these specialized roles?\n\n\nAnother challenge that organizations face in dealing with data is how the team is organized. What are your thoughts on effective strategies for how to structure the communication and reporting structures of data teams? (e.g. centralized, embedded, etc.)\nHow do you recommend evaluating potential candidates for each of the necessary roles?\n\nWhat are your thoughts on when to hire an outside consultant, vs building internal capacity?\n\n\nFor managers who are responsible for data teams, how much understanding of data and analytics do they need to be effective?\n\nHow do you define success or measure performance of a team focused on working with data?\n\n\nWhat are some of the anti-patterns that you have seen in managers who oversee data professionals?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of helping organizations and individuals achieve success in data and analytics?\nWhat advice or additional resources do you have for anyone who is interested in learning more about how to build and grow a successful data team?\n\nContact Info\n\nWebsite\n@jessetanderson on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nData Teams Book\nDBA == Database Administrator\nML Engineer\nDataOps\nThree Vs\nThe Ultimate Guide To Switching Careers To Big Data\nS-1 Report\nJesse Anderson’s Youtube Channel\n\nVideo about interviewing for data teams\n\n\nUber Data Infrastructure Progression Blog Post\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Jesse Anderson about the lessons that he has learned from helping organizations large and small built high functioning data teams that are able to turn big data into valuable products.","date_published":"2020-12-07T18:45:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/de460c8c-4a4a-41ce-8dda-a200fbe058e4.mp3","mime_type":"audio/mpeg","size_in_bytes":56142052,"duration_in_seconds":4350}]},{"id":"podlove-2020-11-30t23:32:16+00:00-a7fd58d8656b88c","title":"Streaming Data Integration Without The Code at Equalum","url":"https://www.dataengineeringpodcast.com/equalum-streaming-data-integration-episode-161","content_text":"Summary\nThe first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nYour host is Tobias Macey and today I’m interviewing Ido Friedman about Equalum, a no-code platform for streaming data integration\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Equalum and how it got started?\nThere are a number of projects and platforms on the market that target data integration. Can you give some context of how Equalum fits in that market and the differentiating factors that engineers should consider?\nWhat components of the data ecosystem might Equalum replace, and which are you designed to integrate with?\nCan you walk through the workflow for someone who is using Equalum for a simple data integration use case?\n\nWhat options are available for doing in-flight transformations of data or creating customized routing rules?\nHow do you handle versioning and staged rollouts of changes to pipelines?\n\n\nHow is the Equalum platform implemented?\n\nHow has the design and architecture of Equalum evolved since it was first created?\nWhat have you found to be the most complex or challenging aspects of building the platform?\n\n\nChange data capture is a growing area of interest, with a significant level of difficulty in implementing well. How do you handle support for the variety of different sources that customers are working with?\n\nWhat are the edge cases that you typically run into when working with changes in databases?\n\n\nHow do you approach the user experience of the platform given its focus as a low code/no code system?\n\nWhat options exist for sophisticated users to create custom operations?\n\n\nHow much of the underlying concerns do you surface to end users, and how much are you able to hide?\nWhat is the process for a customer to integrate Equalum into their existing infrastructure and data systems?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Equalum used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Equalum platform?\nWhen is Equalum the wrong choice?\nWhat do you have planned for the future of Equalum?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nEqualum\nChange Data Capture\n\nDebezium Podcast Episode\n\n\nSQL Server\nDBA == Database Administrator\nFivetran\n\nPodcast Episode\n\n\nSinger\nPentaho\nEMR\nSnowflake\n\nPodcast Episode\n\n\nS3\nKafka\nSpark\nPrometheus\nGrafana\nLogminer\nOBLP == Oracle Binary Log Parser\nAnsible\nTerraform\nJupyter Notebooks\nPapermill\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Equalum platform is architected to provide streaming data integration workflows with a no-code interface.","date_published":"2020-11-30T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c68f2e29-7ad4-4b5c-b2b7-0a629f7f20bc.mp3","mime_type":"audio/mpeg","size_in_bytes":34233477,"duration_in_seconds":2690}]},{"id":"podlove-2020-11-23t22:55:12+00:00-56fcd88661f6902","title":"Keeping A Bigeye On The Data Quality Market","url":"https://www.dataengineeringpodcast.com/bigeye-data-quality-market-episode-160","content_text":"Summary\nOne of the oldest aphorisms about data is \"garbage in, garbage out\", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nYour host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at Bigeye.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your views on what attributes you consider when defining data quality?\nYou use the term \"data semantics\" – can you elaborate on what that means?\nWhat are the driving factors that contribute to the presence or lack of data quality in an organization or data platform?\nWhy do you think now is the right time to focus on data quality as an industry?\nWhat are you building at Bigeye and how did it get started?\nHow does Bigeye help teams understand and manage their data quality?\nWhat is the difference between existing data quality approaches and data observability?\n\nWhat do you see as the tradeoffs for the approach that you are taking at Bigeye?\n\n\nWhat are the most common data quality issues that you’ve seen and what are some more interesting ones that you wouldn’t expect?\nWhere do you see Bigeye fitting into the data management landscape? What are alternatives to Bigeye?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Bigeye being used?\n\nWhat are some of the most interesting homegrown approaches that you have seen?\n\n\nWhat have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building the Bigeye platform and business?\nWhat are the biggest trends you’re following in data quality management?\nWhen is Bigeye the wrong choice?\nWhat do you see in store for the future of Bigeye?\n\nContact Info\n\nYou can email Egor about anything data\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nBigeye\nUber\nA/B Testing\nHadoop\nMapReduce\nApache Impala\nOne King’s Lane\nVertica\nMode\nTableau\nJupyter Notebooks\nRedshift\nSnowflake\nPyTorch\n\nPodcast.__init__ Episode\n\n\nTensorflow\nDataOps\nDevOps\nData Catalog\nDBT\n\nPodcast Episode\n\n\nSRE Handbook\nArticle About How Uber Applied SRE Principles to Data\nSLA == Service Level Agreement\nSLO == Service Level Objective\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nDelta Lake\nGreat Expectations\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nAmundsen\n\nPodcast Episode\n\n\nAlation\nCollibra\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Egor Gryaznov about the market for data quality management and the challenge of keeping your data pipelines healthy.","date_published":"2020-11-23T18:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/03c79306-de5e-4bd5-9365-b0ba801d92fc.mp3","mime_type":"audio/mpeg","size_in_bytes":40237898,"duration_in_seconds":2965}]},{"id":"podlove-2020-11-17t01:11:25+00:00-390317eb0cba8bb","title":"Self Service Data Management From Ingest To Insights With Isima","url":"https://www.dataengineeringpodcast.com/isima-data-management-platform-episode-159","content_text":"Summary\nThe core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nYour host is Tobias Macey and today I’m interviewing Darshan Rawal about Îsíma, a unified platform for building data applications\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you are building at Îsíma?\n\nWhat was your motivation for creating a new platform for data applications?\nWhat is the story behind the name?\n\n\nWhat are the tradeoffs of a fully integrated platform vs a modular approach?\nWhat components of the data ecosystem does Isima replace, and which does it integrate with?\nWhat are the use cases that Isima enables which were previously impractical?\nCan you describe how Isima is architected?\n\nHow has the design of the platform changed or evolved since you first began working on it?\nWhat were your initial ideas or assumptions that have been changed or invalidated as you worked through the problem you’re addressing?\n\n\nWith a focus on the enterprise, how did you approach the user experience design to allow for organizational complexity?\n\nOne of the biggest areas of difficulty that many data systems face is security and scaleable access control. How do you tackle that problem in your platform?\n\n\nHow did you address the issue of geographical distribution of data and users?\nCan you talk through the overall lifecycle of data as it traverses the bi(OS) platform from ingestion through to presentation?\nWhat is the workflow for someone using bi(OS)?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen bi(OS) used?\nWhat have you found to be the most interesting, unexpected, or challenging aspects of building the bi(OS) platform?\nWhen is it the wrong choice?\nWhat do you have planned for the future of Isima and bi(OS)?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nÎsíma\nDatastax\nVerizon\nAT&T\nClick Fraud\nESB == Enterprise Service Bus\nETL == Extract, Transform, Load\nEDW == Enterprise Data Warehouse\nBI == Business Intelligence\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Isima's CEO about their integrated data management platform that simplifies and accelerates delivery of data products.","date_published":"2020-11-16T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8ef19c87-2249-4b38-bb8f-78f191e421c8.mp3","mime_type":"audio/mpeg","size_in_bytes":35687151,"duration_in_seconds":2642}]},{"id":"podlove-2020-11-10t02:26:48+00:00-900a98909a606c4","title":"Building A Cost Effective Data Catalog With Tree Schema","url":"https://www.dataengineeringpodcast.com/tree-schema-data-catalog-episode-158","content_text":"Summary\nA data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nModern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nYour host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what you have built at Tree Schema?\n\nWhat was your motivation for creating it?\n\n\nAt what stage of maturity should a team or organization consider a data catalog to be a necessary component in their data platform?\nThere are a large and growing number of projects and products designed to provide a data catalog, with each of them addressing the problem in a slightly different way. What are the necessary elements for a data catalog?\n\nHow does Tree Schema compare to the available options? (e.g. Amundsen, Company Wiki, Metacat, Metamapper, etc.)\n\n\nHow is the Tree Schema system implemented?\n\nHow has the design or direction of Tree Schema evolved since you first began working on it?\n\n\nHow did you approach the schema definitions for defining entities?\nWhat was your guiding heuristic for determining how to design the interface and data models? – I wrote down notes that combine this with the question above\nHow do you handle integrating with data sources?\nIn addition to storing schema information you allow users to store information about the transformations being performed. How is that represented?\n\nHow can users populate information about their transformations in an automated fashion?\n\n\nHow do you approach evolution and versioning of schema information?\nWhat are the scaling limitations of tree schema, whether in terms of the technical or cognitive complexity that it can handle?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Tree Schema being used?\nWhat have you found to be the most interesting, unexpected, or challenging lessons learned in the process of building and promoting Tree Schema?\nWhen is Tree Schema the wrong choice?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nEmail\nLinkedin\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTree Schema\nTree Schema – Data Lineage as Code\nCapital One\nWalmart Labs\nData Catalog\nData Discovery\nAmundsen\nMetacat\nMarquez\nMetamapper\nInfoworks\nCollibra\nFaust\n\nPodcast.__init__ Episode\n\n\nDjango\nPostgreSQL\nRedis\nCelery\nAmazon ECS (Elastic Container Service)\nDjango Storages\nDagster\nAirflow\nDataHub\nAvro\nSinger\nApache Atlas\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Tree Schema data catalog platform and using it to quickly get visibility into your data assets.","date_published":"2020-11-09T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5ca3eeab-18bf-4adb-a28e-a1d659fd577d.mp3","mime_type":"audio/mpeg","size_in_bytes":42562708,"duration_in_seconds":3112}]},{"id":"podlove-2020-11-02t23:51:43+00:00-0aa9f1f2845e13a","title":"Add Version Control To Your Data Lake With LakeFS","url":"https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157","content_text":"Summary\nData lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYour host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what LakeFS is and why you built it?\n\nThere are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.)\n\n\nWhat are the primary use cases that LakeFS enables?\nFor someone who wants to use LakeFS what is involved in getting it set up?\nHow is LakeFS implemented?\n\nHow has the design of the system changed or evolved since you began working on it?\nWhat assumptions did you have going into it which have since been invalidated or modified?\n\n\nHow does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface?\nHow do you handle merge conflicts and resolution?\n\nWhat are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository?\n\n\nHow do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits?\nGiven that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with?\nWhat are the axes for scaling an installation of LakeFS?\n\nWhat are the limitations in terms of size or geographic distribution of the datasets?\n\n\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS?\nWhen is LakeFS the wrong choice?\nWhat do you have planned for the future of the project?\n\nContact Info\n\nEinat Orr\nOz Katz\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTreeverse\nLakeFS\n\nGitHub\nDocumentation\n\n\nlakeFS Slack Channel\nSimilarWeb\nKaggle\nDagsHub\nAlluxio\nPachyderm\nDVC\nML Ops (Machine Learning Operations)\nDoltHub\nDelta Lake\n\nPodcast Episode\n\n\nHudi\nIceberg Table Format\n\nPodcast Episode\n\n\nKubernetes\nPostgreSQL\n\nPodcast Episode\n\n\nGit\nSpark\nPresto\nCockroachDB\nYugabyteDB\nCitus\nHive Metastore\nIceberg Table Format\nImmunai\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with the creators of LakeFS about how it adds branching and merging capabilities to your data lake for safer and easier integration of new sources and pipelines.","date_published":"2020-11-02T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/42bcea66-68c4-4f0a-9fc0-d062f2db744e.mp3","mime_type":"audio/mpeg","size_in_bytes":41012847,"duration_in_seconds":3015}]},{"id":"podlove-2020-10-26t22:21:38+00:00-cd4732cb3b397f5","title":"Cloud Native Data Security As Code With Cyral","url":"https://www.dataengineeringpodcast.com/cyral-data-security-episode-156","content_text":"Summary\nOne of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Manav Mital about the challenges involved in securing your data and the work that he is doing at Cyral to help address those problems.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Cyral and what motivated you to build a business focused on addressing data security in the cloud?\nCan you start by giving an overview of some of the common security issues that occur when working with data?\n\nWhat new security challenges are introduced by building data platforms in public cloud environments?\n\n\nWhat are the organizational roles that are typically responsible for managing security and access control to data sources and repositories?\n\nWhat are the tensions, technical or organizational, that lead to a problematic or incomplete security posture?\n\n\nWhat are the differences in security requirements and implementation complexity between software applications and data systems?\nWhat are the data systems that Cyral integrates with?\n\nHow did you determine what platforms to prioritize?\n\n\nHow does Cyral integrate into the toolchains used to deploy, maintain, and upgrade an organization’s data infrastructure?\nHow does the Cyral platform address security and access control of data across an organization’s infrastructure?\nHow are schema changes handled when using Cyral to enforce access control to PII or other attributes?\nHow does Cyral help with reducing sprawl of data across unmonitored systems?\nWhat are some of the most interesting, unexpected, or challenging lessons that you learned while building Cyral?\nWhen is Cyral the wrong choice?\nWhat do you have planned for the future of the Cyral platform?\n\nContact Info\n\nLinkedIn\n@manavrm on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCyral\nSnowflake\n\nPodcast Episode\n\n\nBigQuery\nObject Storage\nMongoDB\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Cyral platform and how it enforces data security as code for protecting databases and object storage in the cloud.","date_published":"2020-10-26T18:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3dc6af17-e670-4622-bef1-a53b8668691c.mp3","mime_type":"audio/mpeg","size_in_bytes":35454986,"duration_in_seconds":2912}]},{"id":"podlove-2020-10-19t23:01:26+00:00-006ad0fce51b775","title":"Better Data Quality Through Observability With Monte Carlo","url":"https://www.dataengineeringpodcast.com/monte-carlo-observability-data-quality-episode-155","content_text":"Summary\nIn order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nHow did you come up with the idea to found Monte Carlo?\nWhat is \"data downtime\"?\nCan you start by giving your definition of observability in the context of data workflows?\nWhat are some of the contributing factors that lead to poor data quality at the different stages of the lifecycle?\nMonitoring and observability of infrastructure and software applications is a well understood problem. In what ways does observability of data applications differ from \"traditional\" software systems?\nWhat are some of the metrics or signals that we should be looking at to identify problems in our data applications?\nWhy is this the year that so many companies are working to address the issue of data quality and observability?\nHow are you addressing the challenge of bringing observability to data platforms at Monte Carlo?\nWhat are the areas of integration that you are targeting and how did you identify where to prioritize your efforts?\nFor someone who is using Monte Carlo, how does the platform help them to identify and resolve issues in their data?\nWhat stage of the data lifecycle have you found to be the biggest contributor to downtime and quality issues?\nWhat are the most challenging systems, platforms, or tool chains to gain visibility into?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen teams address their observability needs?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building the business and technology of Monte Carlo?\nWhat are the alternatives to Monte Carlo?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nVisit www.montecarlodata.com?utm_source=rss&utm_medium=rss to lean more about our data reliability platform;\nOr reach out directly to barr@montecarlodata.com — happy to chat about all things data!\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nMonte Carlo\nMonte Carlo Platform\nObservability\nGainsight\nBarracuda Networks\nDevOps\nNew Relic\nDatadog\nNetflix RAD Outlier Detection\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founders of Monte Carlo about how observability of your data platform contributes to higher data quality and reduces data downtime.","date_published":"2020-10-19T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9b23b795-426a-431e-b1ae-4291c42da33c.mp3","mime_type":"audio/mpeg","size_in_bytes":43267537,"duration_in_seconds":3352}]},{"id":"podlove-2020-10-12t23:26:40+00:00-66ccaa7634b74f5","title":"Rapid Delivery Of Business Intelligence Using Power BI","url":"https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154","content_text":"Summary\nBusiness intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nEqualum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, replication, complex transformations, batch processing and full data management using a no-code UI. Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. Tool consolidation and linear scalability without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today to start a free 2 week test run of their platform, and don’t forget to tell them that we sent you.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Rob Collie about Microsoft’s Power BI platform and his work at Power Pivot Pro to help users employ it effectively.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what Power BI is?\nThe business intelligence market is fairly crowded. What are the features of Power BI that make it stand out?\nWho are the target users of Power BI?\n\nHow does the design of the platform reflect those priorities?\n\n\nCan you talk through the workflow for someone to build a report or dashboard in Power BI?\nWhat is the broader ecosystem of data tools and platforms that Power BI sits within?\n\nWhat are the available integration and extension points for Power BI?\n\n\nIn addition to your work at Microsoft building Power BI you now run a consulting company dedicated to helping people adopt that platform. What are some of the common challenges that users face in employing Power BI effectively?\nIn your experience working with clients, what are some of the core principles of data processing and visualization that apply across industries?\n\nWhat are some of the modeling or presentation methods that are specific to a given industry?\n\n\nOne of the perennial challenges of business intelligence is to make reports discoverable. What facilities does Power BI have to aid in surfacing useful information to end users?\nWhat capabilities does Power BI have for exposing elements of data quality?\nWhat are some of the most challenging aspects of building and maintaining a business intelligence effort in an organization?\nWhat are some of the most interesting, unexpected, or innovative uses of Power BI that you have seen, or projects that you have worked on?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned in your work building Power BI and building a business to support its users?\nWhen is Power BI the wrong choice?\nWhat trends in business intelligence are you most excited by?\n\nContact Info\n\nLinkedIn\n@robocolli3 on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nP3\nPower BI\nMicrosoft Excel\nFantasy Football\nExcel Functions\nLisp\nBusiness Intelligence\nVLOOKUP\nLooker\n\nPodcast Episode\n\n\nSQL Server Reporting Services\nSQL Server Analysis Services\nTableau\nMaster Data Management\nERP == Enterprise Resoure Planning\nM Language\nPower Query\nDAX\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Rob Collie about his work building and supporting the Power BI platform at Microsoft and Power Pivot Pro, and why he considers it to be the best option for business intelligence today.","date_published":"2020-10-12T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ad229e52-c675-4078-b91f-1ca643771b95.mp3","mime_type":"audio/mpeg","size_in_bytes":51865297,"duration_in_seconds":3774}]},{"id":"podlove-2020-10-05t23:24:47+00:00-6f26b9bd2c8a19a","title":"Self Service Real Time Data Integration Without The Headaches With Meroxa","url":"https://www.dataengineeringpodcast.com/meroxa-data-integration-episode-153","content_text":"Summary\nAnalytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for data integration\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Meroxa and what motivated you to turn it into a business?\nWhat are the lessons that you learned from your time at Heroku which you are applying to your work on Meroxa?\nWho are your target users and what are your guiding principles for designing the platform interface?\nWhat are the common difficulties that engineers face in building and maintaining data infrastructure?\nThere are a variety of platforms that offer solutions for managing data integration, or powering end-to-end analytics, or building machine learning pipelines. What are the shortcomings of those existing options that might lead someone to choose Meroxa?\nHow is the Meroxa platform architected?\n\nWhat are some of the initial assumptions that you had which have been challenged as you proceed with implementation?\n\n\nWhat new capabilities does Meroxa bring to someone who uses it for integrating their application data?\nWhat are the growth options for organizations that get started with Meroxa?\nWhat are the core principles that you are focused on to allow for evolving your platform over the long run as the surrounding ecosystem continues to mature?\nWhen is Meroxa the wrong choice?\nWhat do you have planned for the future?\n\nContact Info\n\nDeVaris Brown\nAli Hamidi\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMeroxa\nHeroku\nHeroku Kafka\nAscend\nStreamSets\nNexus\nKafka Connect\nAirflow\n\nPodcast.__init__ Episode\n\n\nSpark\n\nData Engineering Episode\n\n\nChange Data Capture\nSegment\n\nPodcast Episode\n\n\nRudderstack\nMParticle\nDebezium\n\nPodcast Episode\n\n\nDBT\n\nPodcast Episode\n\n\nMaterialize\n\nPodcast Episode\n\n\nStitch Data\nFivetran\n\nPodcast Episode\n\n\nElasticsearch\n\nPodcast Episode\n\n\ngRPC\nGraphQL\nREST == REpresentational State Transfer\nDagster/Elementl\n\nData Engineering Podcast Episode\nPodcast.__init__ Episode\n\n\nPrefect\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the co-founders of Meroxa about their work to build a self service data integration platform that is easy to implement and scale for end users.","date_published":"2020-10-05T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/992ed94f-12b6-4b46-90ed-67ab6080cd17.mp3","mime_type":"audio/mpeg","size_in_bytes":49110352,"duration_in_seconds":3655}]},{"id":"podlove-2020-09-29t01:57:22+00:00-5f8ee858c1d338d","title":"Speed Up And Simplify Your Streaming Data Workloads With Red Panda","url":"https://www.dataengineeringpodcast.com/vectorized-red-panda-streaming-data-episode-152","content_text":"Summary\nKafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nIf you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Alexander Gallego about his work at Vectorized building Red Panda as a performance optimized, drop-in replacement for Kafka\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Red Panda is and what motivated you to create it?\nWhat are the limitations of Kafka that make something like Red Panda necessary?\nWhat are the current strengths of the Kafka ecosystem that make it a reasonable implementation target for Red Panda?\nHow is Red Panda architected?\n\nHow has the design or direction changed or evolved since you first began working on it?\n\n\nWhat are the challenges that you face in automatically optimizing the runtime to take advantage of the hardware that it is deployed on?\n\nHow do cloud environments contribute to that complexity?\n\n\nHow are you handling the compatibility layer for the Kafka API?\n\nWhat is your approach for managing versioning and ensuring that you maintain bug compatibility?\n\n\nBeyond performance, what other areas of innovation or improvement in the capabilities and experience do you see while adhering to the Kafka protocol?\nWhat are the opportunities for innovation in the streaming space that aren’t being explored yet?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Redpanda being used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building Red Panda and Vectorized?\nWhen is Red Panda the wrong choice?\nWhat do you have planned for the future of the product and business?\nWhat is your Hack The Planet diversity scholarship?\n\nContact Info\n\n@emaxerrno on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\n\nVectorized\n\n\nFree Download Trial\n\n\n@vectorizedio Company Twitter Accn’t\n\n\nCommunity Slack\n\n\n\n\nConcord alternative to Flink\n\n\nApache Flink\n\nPodcast Episode\n\n\n\nFAANG == Facebook, Apple, Amazon, Netflix, and Google\n\n\nBlackblaze\n\n\nRaft\n\n\nNATS\n\n\nPulsar\n\nPodcast Episode\nStreamNative Podcast Episode\n\n\n\nOpen Messaging Specification\n\n\nScyllaDB\n\n\nCockroachDB\n\n\nMemSQL\n\n\nWASM == Web Assembly\n\n\nDebezium\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.","date_published":"2020-09-28T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/381a1b99-dc2c-4a47-adfa-af836807c7d4.mp3","mime_type":"audio/mpeg","size_in_bytes":44901638,"duration_in_seconds":3580}]},{"id":"podlove-2020-09-20t13:20:48+00:00-f574ef2adb0016c","title":"Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor","url":"https://www.dataengineeringpodcast.com/pipeline-data-engineering-academy-episode-151","content_text":"Summary\nData engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing your thoughts on the current state of the data management industry?\nWhat is your strategy for being effective in the face of so much complexity and conflicting needs for data?\nWhat are some of the common difficulties that you see data engineers contend with, whether technical or social/organizational?\nWhat are the core fundamentals that you think are necessary for data engineers to be effective?\nWhat are the gaps in knowledge or experience that you have seen data engineers contend with?\nYou recently started down the path of building a bootcamp for training data engineers. What was your motivation for embarking on that journey?\n\nHow would you characterize your particular approach?\n\n\nWhat are some of the reasons that your applicants have for wanting to become versed in data engineering?\nWhat is the baseline of capabilities that you expect of your target audience?\nWhat level of proficiency do you aim for when someone has completed your training program?\nWho do you think would not be a good fit for your academy?\nAs a hiring manager, what are the core capabilities that you look for in a data engineering candidate?\n\nWhat are some of the methods that you use to assess competence?\n\n\nWhat are the overall trends in the data management space that you are worried by?\n\nWhich ones are you happy about?\n\n\nWhat are your plans and overall goals for the pipeline academy?\n\nContact Info\n\nLinkedIn\n@soobrosa on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nPipeline Data Engineering Academy\nData Janitor 101\nThe Data Janitor Returns\nBerlin, Germany\nHungary\nUrchin google analytics precursor\nAWS Redshift\nNassim Nicholas Taleb\n\nBlack Swans (affiliate link)\n\n\nKISS == Keep It Simple Stupid\nDan McKinley\nRalph Kimball Data Warehousing design\nFalsehoods Programmers Believe\nApache Kafka\nAWS Kinesis\nETL/ELT\nCI/CD\nTelemetry\nDêpeche Mode\nDesigning Data Intensive Applications (affiliate link)\nStop Hiring DevOps Engineers and Start Growing Them\nT Shaped Engineer\nPipeline Data Engineering Academy Curriculum\nMPP == Massively Parallel Processing\nApache Flink\n\nPodcast Episode\n\n\nFlask web framework\nYAGNI == You Ain’t Gonna Need It\nPair Programming\nClojure\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"In this episode Daniel Molnar shares his experiences as a data janitor and the foundational elements of data engineering, as well as his work to build a practical bootcamp for data engineers.","date_published":"2020-09-21T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d685db72-b04f-47fc-bd1d-b58fb233d98e.mp3","mime_type":"audio/mpeg","size_in_bytes":35426142,"duration_in_seconds":2860}]},{"id":"podlove-2020-09-15t00:37:57+00:00-75d5e466dd28dbd","title":"Distributed In Memory Processing And Streaming With Hazelcast","url":"https://www.dataengineeringpodcast.com/hazelcast-in-memory-processing-episode-150","content_text":"Summary\nIn memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nTree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Hazelcast is and its origins?\nWhat are the benefits and tradeoffs of in-memory computation for data-intensive workloads?\nWhat are some of the common use cases for the Hazelcast in memory grid?\nHow is Hazelcast implemented?\n\nHow has the architecture evolved since it was first created?\n\n\nHow is the Jet streaming framework architected?\n\nWhat was the motivation for building it?\nHow do the capabilities of Jet compare to systems such as Flink or Spark Streaming?\n\n\nHow has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems?\nHow is the governance of the open source grid and Jet projects handled?\n\nWhat is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?\n\n\nWhat is involved in building an application or workflow on top of Hazelcast?\nWhat are the common patterns for engineers who are building on top of Hazelcast?\nWhat is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming?\nWhat are the scaling factors for Hazelcast?\n\nWhat are the edge cases that users should be aware of?\n\n\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used?\nWhen is Hazelcast Grid or Jet the wrong choice?\nWhat is in store for the future of Hazelcast?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nHazelCast\nIstanbul\nApache Spark\nOrientDB\nCAP Theorem\nNVMe\nMemristors\nIntel Optane Persistent Memory\nHazelcast Jet\nKappa Architecture\nIBM Cloud Paks\nDigital Integration Hub (Gartner)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Hazelcast in memory processing grid and the Jet streaming engine that was built on top of it and the use cases that they unlock.","date_published":"2020-09-14T20:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cf1d5363-cf50-44cd-902c-70ecdb97f4d9.mp3","mime_type":"audio/mpeg","size_in_bytes":35587677,"duration_in_seconds":2647}]},{"id":"podlove-2020-09-07t22:11:51+00:00-1aeb0602d2ff857","title":"Simplify Your Data Architecture With The Presto Distributed SQL Engine","url":"https://www.dataengineeringpodcast.com/presto-distributed-sql-episode-149","content_text":"Summary\nDatabases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what Presto is and its origin story?\n\nWhat was the motivation for releasing Presto as open source?\n\n\nFor someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them?\n\nWhat are the primary ways that Presto is being used?\n\n\nI interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth?\nWhat are some of the deployment and scaling considerations that operators of Presto should be aware of?\nWhat are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations?\nWhat are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution?\nWhen designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Presto used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project?\nWhen is Presto the wrong choice?\nWhat is in store for the future of the Presto project and community?\n\nContact Info\n\nLinkedIn\n@mtraverso on Twitter\nmartint on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nPresto\nStarburst Data\n\nPodcast Episode\n\n\nHadoop\nHive\nGlue Metastore\nBigQuery\nKinesis\nApache Pinot\nElasticsearch\nORC\nParquet\nAWS Redshift\nAvro\n\nPodcast Episode\n\n\nLZ4\nZstandard\nKafkaSQL\nFlink\n\nPodcast Episode\n\n\nPyTorch\n\nPodcast.__init__ Episode\n\n\nTensorflow\nSpark\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Martin Traverso about how the Presto distributed SQL engine reduces complexity and increases flexibility for analytical workloads across heterogeneous data sources at scale.","date_published":"2020-09-07T18:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/cc4007e3-73a8-4694-8ec7-b51695c5815a.mp3","mime_type":"audio/mpeg","size_in_bytes":43026977,"duration_in_seconds":3239}]},{"id":"podlove-2020-08-31t10:50:23+00:00-50ee47aa78a9284","title":"Building A Better Data Warehouse For The Cloud At Firebolt","url":"https://www.dataengineeringpodcast.com/firebolt-cloud-data-warehouse-episode-148","content_text":"Summary\nData warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.\nGo to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Eldad Farkash about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi-structured data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Firebolt is and your motivation for building it?\nHow does Firebolt compare to other data warehouse technologies what unique features does it provide?\nThe lines between a data warehouse and a data lake have been blurring in recent years. Where on that continuum does Firebolt lie?\nWhat are the unique use cases that Firebolt allows for?\nHow do the performance characteristics of Firebolt change the ways that an engineer should think about data modeling?\nWhat technologies might someone replace with Firebolt?\nHow is Firebolt architected and how has the design evolved since you first began working on it?\nWhat are some of the most challenging aspects of building a data warehouse platform that is optimized for speed?\nHow do you handle support for nested and semi-structured data?\nIn what ways have you found it necessary/useful to extend SQL?\nDue to the immutability of object storage, for data lakes the update or delete process involves reprocessing a potentially large amount of data. How do you approach that in Firebolt with your F3 format?\nWhat have you found to be the most interesting, unexpected, or challenging lessons while building and scaling the Firebolt platform and business?\nWhen is Firebolt the wrong choice?\nWhat do you have planned for the future of Firebolt?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nFirebolt\nSisense\nSnowflakeDB\n\nPodcast Episode\n\n\nRedshift\nSpark\n\nPodcast Episode\n\n\nParquet\n\nPodcast Episode\n\n\nHadoop\nHDFS\nS3\nAWS Athena\nBigQuery\nData Vault\n\nPodcast Episode\n\n\nStar Schema\nDimensional Modeling\nSlowly Changing Dimensions\nJDBC\nTPC Benchmarks\nDBT\n\nPodcast Episode\n\n\nTableau\nLooker\n\nPodcast Episode\n\n\nPrestoSQL\n\nPodcast Episode\n\n\nPostgreSQL\n\nPodcast Episode\n\n\nFoundationDB\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Firebolt is building a cloud data warehouse optimized for speed and cost effectiveness to unlock your data's full potential.","date_published":"2020-08-31T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/82c780a1-2855-4624-a1c2-883b33c29de5.mp3","mime_type":"audio/mpeg","size_in_bytes":39747773,"duration_in_seconds":3950}]},{"id":"podlove-2020-08-25t00:22:25+00:00-53d3944939a69e0","title":"Metadata Management And Integration At LinkedIn With DataHub","url":"https://www.dataengineeringpodcast.com/datahub-metadata-management-episode-147","content_text":"Summary\nIn order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nIf you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what DataHub is and some of its back story?\n\nWhat were you using at LinkedIn for metadata management prior to the introduction of DataHub?\nWhat was lacking in the previous solutions that motivated you to create a new platform?\n\n\nThere are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options?\nWho is the target audience for DataHub?\n\nHow do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub?\n\n\nCan you describe how DataHub is architected?\n\nHow has it evolved since you first began working on it?\n\n\nWhat was your motivation for releasing DataHub as an open source project?\n\nWhat have been the benefits of that decision?\nWhat are the challenges that you face in maintaining changes between the public repository and your internally deployed instance?\n\n\nWhat is the workflow for populating metadata into DataHub?\nWhat are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored?\nHow do you handle discovery of data assets for users of DataHub?\nWhat are the integration and extension points of the platform?\nWhat is involved in deploying and maintaining and instance of the DataHub platform?\nWhat are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn?\nWhat are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub?\nWhen is DataHub the wrong choice?\nWhat do you have planned for the future of the project?\n\nContact Info\n\nMars\n\nLinkedIn\nmars-lan on GitHub\n\n\nPardhu\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataHub\nMap/Reduce\nApache Flume\nLinkedIn Blog Post introducing DataHub\nWhereHows\nHive Metastore\nKafka\nCDC == Change Data Capture\n\nPodcast Episode\n\n\nPDL LinkedIn language\nGraphQL\nElasticsearch\nNeo4J\nApache Pinot\nApache Gobblin\nApache Samza\nOpen Sourcing DataHub Blog Post\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how LinkedIn designed their metadata management platform and how it is being used to power data discovery and integration.","date_published":"2020-08-24T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/dd6c95ec-f56a-42d0-bac2-8435fe942272.mp3","mime_type":"audio/mpeg","size_in_bytes":32072988,"duration_in_seconds":3064}]},{"id":"podlove-2020-08-17t02:55:18+00:00-846a4d33b0e75a8","title":"Exploring The TileDB Universal Data Engine","url":"https://www.dataengineeringpodcast.com/tiledb-universal-data-engine-episode-146","content_text":"Summary\nMost databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.\nGo to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what TileDB is and the problem that you are trying to solve with it?\n\nWhat was your motivation for building it?\n\n\nWhat are the main use cases or problem domains that you are trying to solve for?\n\nWhat are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?\n\n\nWhat are the benefits of using matrices for data processing and domain modeling?\n\nWhat are the challenges that you have faced in storing and processing sparse matrices efficiently?\nHow does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?\n\n\nWhat are the benefits of unbundling the storage engine from the processing layer\nCan you describe how TileDB embedded is architected?\n\nHow has the design evolved since you first began working on it?\n\n\nWhat is your approach to integrating with the broader ecosystem of data storage and processing utilities?\nWhat does the workflow look like for someone using TileDB?\nWhat is required to deploy TileDB in a production context?\nHow is the built in data versioning implemented?\n\nWhat is the user experience for interacting with different versions of datasets?\nHow do you manage the lifecycle of versioned data to allow garbage collection?\n\n\nHow are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?\nWhat are the most interesting, unexpected, or innovative ways that you have seen TileDB used?\nWhat have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?\nWhat features or capabilities are you consciously deciding not to implement?\nWhen is TileDB the wrong choice?\nWhat do you have planned for the future of TileDB?\n\nContact Info\n\nLinkedIn\nstavrospapadopoulos on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTileDB\n\nGitHub\n\n\nData Frames\nTileDB Cloud\nMIT\nIntel\nSparse Linear Algebra\nSparse Matrices\nHDF5\nDask\nSpark\nMariaDB\nPrestoDB\nGDAL\nPDAL\nTuring Complete\nClustered Index\nParquet File Format\n\nPodcast Episode\n\n\nSerializability\nDelta Lake\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the creator of TileDB about building a universal data engine to support cross-domain collaboration and reduce the burden of data management.","date_published":"2020-08-17T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2adf504f-1839-4457-93de-8037fe3d52e6.mp3","mime_type":"audio/mpeg","size_in_bytes":48677009,"duration_in_seconds":3944}]},{"id":"podlove-2020-08-10t21:38:40+00:00-596074e44dd7f1b","title":"Closing The Loop On Event Data Collection With Iteratively","url":"https://www.dataengineeringpodcast.com/iteratively-event-data-collection-episode-145","content_text":"Summary\nEvent based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nIf you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Iteratively and your motivation for creating it?\nWhat are some of the ways that you have seen inconsistent message structures cause problems?\nWhat are some of the common anti-patterns that you have seen for managing the structure of event messages?\nWhat are the benefits that Iteratively provides for the different roles in an organization?\nCan you describe the workflow for a team using Iteratively?\nHow is the Iteratively platform architected?\n\nHow has the design changed or evolved since you first began working on it?\n\n\nWhat are the difficulties that you have faced in building integrations for the Iteratively workflow?\nHow is schema evolution handled throughout the lifecycle of an event?\nWhat are the challenges that engineers face in building effective integration tests for their event schemas?\nWhat has been your biggest challenge in messaging for your platform and educating potential users of its benefits?\nWhat are some of the most interesting or unexpected ways that you have seen Iteratively used?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building Iteratively?\nWhen is Iteratively the wrong choice?\nWhat do you have planned for the future of Iteratively?\n\nContact Info\n\nPatrick\n\nLinkedIn\n@Patrickt010 on Twitter\nWebsite\n\n\nOndrej\n\nLinkedIn\n@ondrej421 on Twitter\nondrej on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nIteratively\nSyncplicity\nLocally Optimistic\nDBT\n\nPodcast Episode\n\n\nSnowplow Analytics\n\nPodcast Episode\n\n\nJSON Schema\nMaster Data Management\n\nPodcast Episode\n\n\nSDLC == Software Development Life Cycle\nAmplitude\nMixpanel\nMode Analytics\nCRUD == Create, Read, Update, Delete\nSegment\n\nPodcast Episode\n\n\nSchemaver (JSON Schema Versioning Strategy)\nGreat Expectations\n\nPodcast.init Interview\nData Engineering Podcast Interview\n\n\nConfluence\nNotion\nConfluent Schema Registry\n\nPodcast Episode\n\n\nSnowplow Iglu Schema Registry\nPulsar Schema Registry\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the co-founders of Iteratively about building a platform that solves the collaboration gap for event data collection.","date_published":"2020-08-10T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9c72e917-82f0-48ce-a212-3e0e84845bb4.mp3","mime_type":"audio/mpeg","size_in_bytes":71153287,"duration_in_seconds":3557}]},{"id":"podlove-2020-08-04t02:20:06+00:00-0b95a83b3881470","title":"A Practical Introduction To Graph Data Applications","url":"https://www.dataengineeringpodcast.com/practical-graph-data-episode-144","content_text":"Summary\nFinding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.\nGo to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Denise Gosnell and Matthias Broecheler about the recently published practitioner’s guide to graph data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what your goals are for the Practitioner’s Guide To Graph Data?\n\nWhat was your motivation for writing a book to address this topic?\n\n\nWhat do you see as the driving force behind the growing popularity of graph technologies in recent years?\nWhat are some of the common use cases/applications of graph data and graph traversal algorithms?\n\nWhat are the core elements of graph thinking that data teams need to be aware of to be effective in identifying those cases in their existing systems?\n\n\nWhat are the fundamental principles of graph technologies that data engineers should be familiar with?\n\nWhat are the core modeling principles that they need to know for designing schemas in a graph database?\n\n\nBeyond databases, what are some of the other components of the data stack that can or should handle graphs natively?\nDo you typically use a graph database as the primary or complementary data store?\nWhat are some of the common challenges that you see when bringing graph applications to production?\nWhat have you found to be some of the common points of confusion or error prone aspects of implementing and maintaining graph oriented applications?\nWhen it comes to the specific technologies of different graph databases, what are some of the edge cases/variances in the interfaces or modeling capabilities that they present?\n\nHow does the variation in query languages impact the overall adoption of these technologies?\n\nWhat are your thoughts on the recent standardization of GSQL as an ANSI specification?\n\n\n\n\nWhat are some of the scaling challenges that exist for graph data engines?\nWhat are the ongoing developments/improvements/trends in graph technology that you are most excited about?\n\nWhat are some of the shortcomings in existing technology/ecosystem for graph applications that you would like to see addressed?\n\n\nWhat are some of the cases where a graph is the wrong abstraction for a data project?\nWhat are some of the other resources that you recommend for anyone who wants to learn more about the various aspects of graph data?\n\nContact Info\n\nDenise\n\nLinkedIn\n@DeniseKGosnell on Twitter\n\n\nMatthias\n\nLinkedIn\n@MBroecheler on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nThe Practitioner’s Guide To Graph Data\nDatastax\nTitan graph database\nGoethe\nGraph Database\nNoSQL\nRelational Database\nElasticsearch\n\nPodcast Episode\n\n\nAssociative Array Data Structure\nRDF Triple\nDatastax Multi-model Graph Database\nSemantic Web\nGremlin Graph Query Language\nSuper Node\nNeuromorphic Computing\nDatastax Desktop\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview with the authors of the Practitioner's Guide To Graph Data about how, when, and why to use graph data algorithms and data structures.","date_published":"2020-08-03T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9fe771b2-19c4-443f-95c9-82ea3765fc18.mp3","mime_type":"audio/mpeg","size_in_bytes":72864307,"duration_in_seconds":3643}]},{"id":"podlove-2020-07-28t00:07:42+00:00-d9ec3cb4d91e21e","title":"Build More Reliable Distributed Systems By Breaking Them With Jepsen","url":"https://www.dataengineeringpodcast.com/jepsen-distributed-systems-testing-episode-143","content_text":"Summary\nA majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nIf you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what the Jepsen project is?\n\nWhat was your inspiration for starting the project?\n\n\nWhat other methods are available for evaluating and stress testing distributed systems?\nWhat are some of the common misconceptions or misunderstanding of distributed systems guarantees and how they impact real world usage of things like databases?\nHow do you approach the design of a test suite for a new distributed system?\n\nWhat is your heuristic for determining the completeness of your test suite?\n\n\nWhat are some of the common challenges of setting up a representative deployment for testing?\nCan you walk through the workflow of setting up, running, and evaluating the output of a Jepsen test?\nHow is Jepsen implemented?\n\nHow has the design evolved since you first began working on it?\nWhat are the pros and cons of using Clojure for building Jepsen?\nIf you were to start over today on the Jepsen framework what would you do differently?\n\n\nWhat are some of the most common failure modes that you have identified in the platforms that you have tested?\nWhat have you found to be the most difficult to resolve distributed systems bugs?\nWhat are some of the interesting developments in distributed systems design that you are keeping an eye on?\nHow do you perceive the impact that Jepsen has had on modern distributed systems products?\nWhat have you found to be the most interesting, unexpected, or challenging lessons learned while building Jepsen and evaluating mission critical systems?\nWhat do you have planned for the future of the Jepsen framework?\n\nContact Info\n\naphyr on GitHub\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nJepsen\nRiak\nDistributed Systems\nTLA+\nCoq\nIsabelle\nCassandra DTest\nFoundationDB\n\nPodcast Episode\n\n\nCRDT == Conflict-free Replicated Data-type\n\nPodcast Episode\n\n\nRiemann\nClojure\nJVM == Java Virtual Machine\nKotlin\nHaskell\nScala\nGroovy\nTiDB\nYugabyteDB\n\nPodcast Episode\n\n\nCockroachDB\n\nPodcast Episode\n\n\nRaft consensus algorithm\nPaxos\nLeslie Lamport\nCalvin\nFaunaDB\n\nPodcast Episode\n\n\nHeidi Howard\nCALM Conjecture\nCausal Consistency\nHillel Wayne\nChristopher Meiklejohn\nDistsys Class\nDistributed Systems For Fun And Profit by\nMikito Takada\nChristopher Meiklejohn Reading List\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Jepsen creator Kyle Kingsbury about what he has learned about distributed systems by breaking them.","date_published":"2020-07-27T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c81bc86a-797f-4499-97a9-284a54b382ca.mp3","mime_type":"audio/mpeg","size_in_bytes":41846606,"duration_in_seconds":2978}]},{"id":"podlove-2020-07-21t01:22:00+00:00-0b5d3a7c2e47cee","title":"Making Wind Energy More Efficient With Data At Turbit Systems","url":"https://www.dataengineeringpodcast.com/turbit-systems-wind-energy-episode-142","content_text":"Summary\nWind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.\nGo to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Michael Tegtmeier about Turbit, a machine learning powered platform for performance monitoring of wind farms\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Turbit and your motivation for creating the business?\nWhat are the most problematic factors that contribute to low performance in power generation with wind turbines?\nWhat is the current state of the art for accessing and analyzing data for wind farms?\nWhat information are you able to gather from the SCADA systems in the turbine?\n\nHow uniform is the availability and formatting of data from different manufacturers?\n\n\nHow are you handling data collection for the individual turbines?\n\nHow much information are you processing at the point of collection vs. sending to a centralized data store?\n\n\nCan you describe the system architecture of Turbit and the lifecycle of turbine data as it propagates from collection to analysis?\nHow do you incorporate domain knowledge into the identification of useful data and how it is used in the resultant models?\nWhat are some of the most challenging aspects of building an analytics product for the wind energy sector?\nWhat have you found to be the most interesting, unexpected, or challenging aspects of building and growing Turbit?\nWhat do you have planned for the future of the technology and business?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nTurbit Systems\nLIDAR\nPulse Shaping\nWind Turbine\nSCADA\nGenetic Algorithm\nBremen Germany\nPitch\nYaw\nNacelle\nAnemometer\nNeural Network\nSwarm64\n\nPodcast Episode\n\n\nTensorflow\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the founder of Turbit Systems about how they are improving the efficiency and sustainability of wind energy through real time analysis of data collected from turbines.","date_published":"2020-07-20T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ac807b69-68a3-4141-8ec9-3cc13c001d1b.mp3","mime_type":"audio/mpeg","size_in_bytes":36784702,"duration_in_seconds":2448}]},{"id":"podlove-2020-07-12t23:02:16+00:00-53c513da92be849","title":"Open Source Production Grade Data Integration With Meltano","url":"https://www.dataengineeringpodcast.com/meltano-data-integration-episode-141","content_text":"Summary\nThe first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.\nGo to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Meltano is and the story behind it?\nWho is the target audience?\n\nHow does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?\n\n\nWhat have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?\n\nWhat are the most painful transitions in that lifecycle and how does that pain manifest?\n\n\nHow and why has the focus of the project shifted from its original vision?\nWith your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?\n\nWhat are the main elements of your strategy to address these barriers?\n\n\nHow is the Meltano platform in its current incarnation implemented?\n\nHow much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction?\n\n\nWhat have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use?\nWhat are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano?\nWhen is Meltano the wrong choice?\nWhat is your broad vision for the future of Meltano?\n\nWhat are the most immediate needs for contribution that will help you realize that vision?\n\n\n\nContact Info\n\nWebsite\nDouweM on GitLab\nDouweM on GitHub\n@DouweM on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMeltano\nGitLab\nMexico City\nNetherlands\nLocally Optimistic\nSinger\nStitch Data\nDBT\nELT\nInformatica\nVersion Control\nCode Review\nCI/CD\nJupyter Notebook\nLookML\nMeltano Modeling Syntax\nRedash\nMetabase\nApache Superset\nApache Airflow\nLuigi\nPrefect\nDagster\nTransferwise\nPipelinewise\n12 Factor Application\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Meltano project and their goal of building a fully open source data integration platform that is competitive with commercial systems.","date_published":"2020-07-13T13:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f3fd8161-35ea-4e08-ad2a-525ecdb10eee.mp3","mime_type":"audio/mpeg","size_in_bytes":42714203,"duration_in_seconds":3919}]},{"id":"podlove-2020-07-06t16:31:12+00:00-44ae2ce28021007","title":"DataOps For Streaming Systems With Lenses.io","url":"https://www.dataengineeringpodcast.com/lenses-streaming-dataops-episode-140","content_text":"Summary\nThere are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nToday’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Lenses is and the story behind it?\nWhat is your working definition for what constitutes DataOps?\n\nHow does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?\n\nWhat are the typical barriers to collaboration, and how does Lenses help with that?\n\n\n\n\nMany different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it?\nWhat are the main challenges that you see engineers facing when working with streaming systems?\nWhat have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms?\nOne of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it?\nOn the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data?\n\nWhat are some useful strategies for collecting and analyzing traces of data flows?\n\n\nAs with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications?\n\nHow do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems?\n\n\nHow is the Lenses platform implemented and how has its design evolved since you first began working on it?\nWhat are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)?\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses?\nWhen is Lenses the wrong choice?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\n@StevensonA_D on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nLenses.io\nBabylon Health\nDevOps\nDataOps\nGitOps\nApache Calcite\nkSQL\nKafka Connect Query Language\nApache Flink\n\nPodcast Episode\n\n\nApache Spark\n\nPodcast Episode\n\n\nApache Pulsar\n\nPodcast Episode\nStreamNative Episode\n\n\nPlaytika\nRiskfuel(?)\nJMX Metrics\nAmazon MSK (Managed Streaming for Kafka)\nPrometheus\nCanary Deployment\nKafka on Pulsar\nData Catalog\nData Mesh\n\nPodcast Episode\n\n\nDagster\nAirflow\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview about how the Lenses.io platform addresses the DataOps challenges for streaming systems to power observability, discovery, and governance of your real time data.","date_published":"2020-07-06T13:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e74fed0f-e240-44f3-b2f4-acb32d6309da.mp3","mime_type":"audio/mpeg","size_in_bytes":36691311,"duration_in_seconds":2736}]},{"id":"podlove-2020-06-29t12:09:04+00:00-8c66466f3a18d25","title":"Data Collection And Management To Power Sound Recognition At Audio Analytic","url":"https://www.dataengineeringpodcast.com/audio-analytic-sound-recognition-episode-139","content_text":"Summary\nWe have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!\nYour host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Audio Analytic?\n\nWhat was your motivation for building an AI platform for sound recognition?\n\n\nWhat are some of the ways that your platform is being used?\nWhat are the unique challenges that you have faced in working with arbitrary sound data?\nHow do you handle the collection and labelling of the source data that you rely on for building your models?\n\nBeyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with?\nHow has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?\n\n\nchallenges of building an embeddable AI model\n\nupdate cycle\n\n\ndifficulty of identifying relevant audio and dealing with literal noise in the input data\nrights and ownership challenges in collection of source data\nWhat was your design process for constructing a pipeline for the audio data that you need to process?\nCan you describe how your overall data management system is architected?\n\nHow has that architecture evolved since you first began building and using it?\n\n\nA majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio?\nWhat are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering?\nHow do you address variability in the duration of source samples in the processing pipeline?\nHow much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run?\nWhat are the limitations of the model in dealng with complex and layered audio environments?\n\nHow has the testing and evaluation of your model fed back into your strategies for collecting source data?\n\n\nWhat are some of the weirdest or most unusual sounds that you have worked with?\nWhat have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic?\nWhat do you have planned for the future of the company?\n\nContact Info\n\nChris\n\nLinkedIn\n\n\nThomas\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nAudio Analytic\n\nTwitter\n\n\nAnechoic Chamber\nEXIF Data\nID3 Tags\nPolyphonic Sound Detection Score\n\nGitHub Repository\n\n\nICASSP\nCES\nMO+ ARM Processor\nContext Systems Blog Post\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Audio Analytic is building a data set of high quality audio samples from scratch to power their sound recognition technology.","date_published":"2020-06-29T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/816ffa25-b257-45a1-8835-d05333e91f3d.mp3","mime_type":"audio/mpeg","size_in_bytes":36560806,"duration_in_seconds":3448}]},{"id":"podlove-2020-06-23t01:11:28+00:00-aba1edfe74099e2","title":"Bringing Business Analytics To End Users With GoodData","url":"https://www.dataengineeringpodcast.com/gooddata-business-analytics-episode-138","content_text":"Summary\nThe majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring business intelligence use cases to a broader audience. GoodData is a platform focused on simplifying the work of bringing data to employees and end users. In this episode Sheila Jung and Philip Farr discuss how the GoodData platform is being used, how it is architected to provide scalable and performant analytics, and how it integrates into customer’s data platforms. This was an interesting conversation about a different approach to business intelligence and the importance of expanded access to data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nGoodData is revolutionizing the way in which companies provide analytics to their customers and partners. Start now with GoodData Free that makes our self-service analytics platform available to you at no cost. Register today at dataengineeringpodcast.com/gooddata\nYour host is Tobias Macey and today I’m interviewing Sheila Jung and Philip Farr about how GoodData is building a platform that lets you share your analytics outside the boundaries of your organization\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at GoodData and some of its origin story?\nThe business intelligence market has been around for decades now and there are dozens of options with different areas of focus. What are the factors that might motivate me to choose GoodData over the other contenders in the space?\nWhat are the use cases and industries that you focus on supporting with GoodData?\nHow has the market of business intelligence tools evolved in recent years?\n\nWhat are the contributing trends in technology and business use cases that are driving that change?\n\n\nWhat are some of the ways that your customers are embedding analytics into their own products?\nWhat are the differences in processing and serving capabilities between an internally used business intelligence tool, and one that is used for embedding into externally used systems?\n\nWhat unique challenges are posed by the embedded analytics use case?\nHow do you approach topics such as security, access control, and latency in a multitenant analytics platform?\n\n\nWhat guidelines have you found to be most useful when addressing the concerns of accuracy and interpretability of the data being presented?\nHow is the GoodData platform architected?\n\nWhat are the complexities that you have had to design around in order to provide performant access to your customers’ data sources in an interactive use case?\nWhat are the off-the-shelf components that you have been able to integrate into the platform, and what are the driving factors for solutions that have been built specifically for the GoodData use case?\n\n\nWhat is the process for your users to integrate GoodData into their existing data platform?\nWhat is the workflow for someone building a data product in GoodData?\nHow does GoodData manage the lifecycle of the data that your customers are presenting to their end users?\nHow does GoodData integrate into the customer development lifecycle?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with GoodData?\nCan you give an overview of the MAQL (Multi-Dimension Analytical Query Language) dialect that you use in GoodData and contrast it with SQL?\n\nWhat are the benefits and additional functionality that MAQL provides?\n\n\nWhen is GoodData the wrong choice?\nWhat is on the roadmap for the future of GoodData?\n\nContact Info\n\nSheila\n\nLinkedIn\n\n\nPhilip\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nGoodData\nTeradata\nReactJS\nSnowflakeDB\n\nPodcast Episode\n\n\nRedshift\nBigQuery\nSOC2\nHIPAA\nGDPR == General Data Protection Regulation\nIoT == Internet of Things\nSAML\nRuby\nMulti-Dimension Analytical Query Language\nKubernetes\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring business intelligence use cases to a broader audience. GoodData is a platform focused on simplifying the work of bringing data to employees and end users. In this episode Sheila Jung and Philip Farr discuss how the GoodData platform is being used, how it is architected to provide scalable and performant analytics, and how it integrates into customer’s data platforms. This was an interesting conversation about a different approach to business intelligence and the importance of expanded access to data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the GoodData platform lets you bring business analytics to your customers and end users.","date_published":"2020-06-22T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0921a207-65ee-46bf-8444-7b0de976d4f8.mp3","mime_type":"audio/mpeg","size_in_bytes":41331453,"duration_in_seconds":3144}]},{"id":"podlove-2020-06-15t12:37:40+00:00-b380c2a5ca8e22e","title":"Accelerate Your Machine Learning With The StreamSQL Feature Store","url":"https://www.dataengineeringpodcast.com/streamsql-machine-learning-feature-store-episode-137","content_text":"Summary\nMachine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYour host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL\n\nInterview\n\nIntroduction\nHow did you get involved in the areas of machine learning and data management?\nWhat is StreamSQL and what motivated you to start the business?\nCan you describe what a machine learning feature is?\nWhat is the difference between generating features for training a model and generating features for serving?\nHow is feature management typically handled today?\nWhat is a feature store and how is it different from the status quo?\nWhat is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?\nHow does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?\nWhat are the general requirements of a feature store?\nWhat additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?\n\nHow is discovery and documentation of features handled?\n\n\nWhat is the current landscape of feature stores and how does StreamSQL compare?\nHow is the StreamSQL feature store implemented?\n\nHow is the supporting infrastructure architected and how has it evolved since you first began working on it?\n\n\nWhy is streaming data such a focal point of feature stores?\nHow do you generate features for training?\nHow do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?\nHow do you handle versioning and deploying features?\nWhat’s the process for integrating data sources into StreamSQL for processing into features?\nHow are the features materialized?\nWhat are the most challenging or complex aspects of working on or with a feature store?\nWhen is StreamSQL the wrong choice for a feature store?\nWhat are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?\nWhat do you have planned for the future of the product?\n\nContact Info\n\nLinkedIn\n@simba_khadder on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nStreamSQL\nFeature Stores for ML\nDistributed Systems\nGoogle Cloud Datastore\nTriton\nUber Michelangelo\nAirBnB Zipline\nLyft Dryft\nApache Flink\n\nPodcast Episode\n\n\nApache Kafka\nSpark Streaming\nApache Cassandra\nRedis\nApache Pulsar\n\nPodcast Episode\nStreamNative Episode\n\n\nTDD == Test Driven Development\nLyft presentation – Bootstrapping Flink\nGo-Jek Feast\nHopsworks\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the creator of StreamSQL on the complexities of building a feature store and the benefits that they provide to the development and delivery of machine learning models.","date_published":"2020-06-15T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/eb0791b9-5d8a-4946-a458-1f7b3e209923.mp3","mime_type":"audio/mpeg","size_in_bytes":39046550,"duration_in_seconds":2772}]},{"id":"podlove-2020-06-08t11:26:24+00:00-827c9331ec0e9f0","title":"Data Management Trends From An Investor Perspective","url":"https://www.dataengineeringpodcast.com/redpoint-ventures-data-management-trends-episode-136","content_text":"Summary\nThe landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYou listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.\nYour host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of Redpoint Ventures and your role there?\nFrom an investor perspective, what is most appealing about the category of data-oriented businesses?\nWhat are the main sources of information that you rely on to keep up to date with what is happening in the data industry?\n\nWhat is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation?\n\n\nAs someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem?\nIn your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn:\n\nWhat are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?\n\nWhat are the unsolved areas that you see as being viable for newcomers?\n\n\nWhat are the challenges faced by businesses in establishing and maintaining data catalogs?\n\nWhat approaches are being taken by the companies who are trying to solve this problem?\n\nWhat shortcomings do you see in the available products?\n\n\n\n\nFor gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches?\n\nWhat additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs?\nWhat challenges do businesses in this observability space face to provide useful access and analysis to this collected data?\n\n\nStreaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective?\n\nWhat are the main factors that you see as driving this growth in the need for access to streaming data?\n\n\n\n\nWith your focus on these trends, how does that influence your investment decisions and where you spend your time?\nWhat are the unaddressed markets or product categories that you see which would be lucrative for new businesses?\nIn most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem?\n\nFor data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with?\n\n\n\nContact Info\n\n@AstasiaMyers on Twitter\n@astasia on Medium\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nRedpoint Ventures\n4 Data Trends To Watch in 2020\nSeagate\nWestern Digital\nPure Storage\nCisco\nCohesity\nLooker\n\nPodcast Episode\n\n\nDGraph\n\nPodcast Episode\n\n\nDremio\n\nPodcast Episode\n\n\nSnowflakeDB\n\nPodcast Episode\n\n\nThoughspot\nTibco\nElastic\nSplunk\nInformatica\nData Council\nDataCoral\nMattermost\nBitwarden\nSnowplow\n\nPodcast Interview\nInterview About Snowplow Infrastructure\n\n\nCHAOSSEARCH\n\nPodcast Episode\n\n\nKafka Streams\nPulsar\n\nPodcast Interview\nFollowup Podcast Interview\n\n\nSoda\nToro\nGreat Expectations\nAlation\nCollibra\nAmundsen\nDataHub\nNetflix Metacat\nMarquez\n\nPodcast Episode\n\n\nLDAP == Lightweight Directory Access Protocol\nAnodot\nDatabricks\nFlink\n\nPodcast Episode\n\n\nZookeeper\n\nPodcast Episode\n\n\nPravega\n\nPodcast Episode\n\n\nAirtable\nAlteryx\nCockroachDB\n\nPodcast Episode\n\n\nSuperset\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Astasia Myers of Redpoint Ventures on the data management industry trends that she is paying attention to as an investor.","date_published":"2020-06-08T07:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a5cbadd0-0439-4812-8ba9-93e770aac4a9.mp3","mime_type":"audio/mpeg","size_in_bytes":51425669,"duration_in_seconds":3298}]},{"id":"podlove-2020-06-01t00:36:05+00:00-30bc586eb21d7ab","title":"Building A Data Lake For The Database Administrator At Upsolver","url":"https://www.dataengineeringpodcast.com/upsolver-data-lake-database-administrator-episode-135","content_text":"Summary\nData lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nYou listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.\nYour host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your definition of what a data lake is and what it is comprised of?\nWe talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?\n\nHow has Upsolver changed or evolved since we last spoke?\n\nHow has the evolution of the underlying technologies impacted your implementation and overall product strategy?\n\n\n\n\nWhat are some of the common challenges that accompany a data lake implementation?\nHow do those challenges influence the adoption or viability of a data lake?\nHow does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?\n\nWhat are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?\n\n\nWhat are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform?\nHow is the SQL layer in Upsolver implemented?\n\nWhat are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?\n\n\nWhat are the main concepts that you need to educate your customers on?\nWhat are some of the pitfalls that users should be aware of?\nWhat features of your platform are often overlooked or underutilized which you think should be more widely adopted?\nWhat have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver?\nWhat do you have planned for the future?\n\nContact Info\n\nOri\n\nLinkedIn\n\n\nYoni\n\nyoniiny on GitHub\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nUpsolver\n\nPodcast Episode\n\n\nDBA == Database Administrator\nIDF == Israel Defense Forces\nData Lake\nEventual Consistency\nApache Spark\nRedshift Spectrum\nAzure Synapse Analytics\nSnowflakeDB\n\nPodcast Episode\n\n\nBigQuery\nPresto\n\nPodcast Episode\n\n\nApache Kafka\nCartesian Product\nkSQLDB\n\nPodcast Episode\n\n\nEventador\n\nPodcast Episode\n\n\nMaterialize\n\nPodcast Episode\n\n\nCommon Table Expressions\nLambda Architecture\nKappa Architecture\nApache Flink\n\nPodcast Episode\n\n\nReinforcement Learning\nCloudformation\nGDPR\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about Upsolver's mission to build a data lake that empowers the database administrator to step into the world of big data","date_published":"2020-06-01T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f81e3854-ecb8-4fc8-a156-15a1509beacc.mp3","mime_type":"audio/mpeg","size_in_bytes":37228584,"duration_in_seconds":3377}]},{"id":"podlove-2020-05-25t12:21:40+00:00-d57789a352658b8","title":"Mapping The Customer Journey For B2B Companies At Dreamdata","url":"https://www.dataengineeringpodcast.com/dreamdata-b2b-customer-journey-episode-134","content_text":"Summary\nGaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhat are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.\nYour host is Tobias Macey and today I’m interviewing Ole Dallerup about Dreamdata, a platform for simplifying data integration for B2B companies\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you are building at Dreamata?\n\nWhat was your inspiration for starting a company and what keeps you motivated?\n\n\nHow do the data requirements differ between B2C and B2B companies?\nWhat are the challenges that B2B companies face in gaining visibility across the lifecycle of their customers?\n\nHow does that lack of visibility impact the viability or growth potential of the business?\nWhat are the factors that contribute to silos in visibility of customer activity within a business?\n\n\nWhat are the data sources that you are dealing with to generate meaningful analytics for your customers?\nWhat are some of the challenges that business face in either generating or collecting useful information about their customer interactions?\nHow is the technical platform of Dreamdata implemented and how has it evolved since you first began working on it?\nWhat are some of the ways that you approach entity resolution across the different channels and data sources?\nHow do you reconcile the information collected from different sources that might use disparate data formats and representations?\nWhat is the onboarding process for your customers to identify and integrate with all of their systems?\nHow do you approach the definition of the schema model for the database that your customers implement for storing their footprint?\n\nDo you allow for customization by the customer?\nDo you rely on a tool such as DBT for populating the table definitions and transformations from the source data?\n\n\nHow do you approach representation of the analysis and actionable insights to your customers so that they are able to accurately intepret the results?\nHow have your own experiences at Dreamdata influenced the areas that you invest in for the product?\nWhat are some of the most interesting or surprising insights that you have been able to gain as a result of the unified view that you are building?\nWhat are some of the most challenging, interesting, or unexpected lessons that you have learned from building and growing the technical and business elements of Dreamdata?\nWhen might a user be better served by building their own pipelines or analysis for tracking their customer interactions?\nWhat do you have planned for the future of Dreamdata?\nWhat are some of the industry trends that you are keeping an eye on and what potential impacts to your business do you anticipate?\n\nContact Info\n\nLinkedIn\n@oledallerup on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDreamdata\nPoker Tracker\nTrustPilot\nZendesk\nSalesforce\nHubspot\nGoogle BigQuery\nSnowflakeDB\n\nPodcast Episode\n\n\nAWS Redshift\nSinger\nStitch Data\nDataform\n\nPodcast Episode\n\n\nDBT\n\nPodcast Episode\n\n\nSegment\n\nPodcast Episode\n\n\nCloud Dataflow\nApache Beam\nUTM Parameters\nClearbit\nCapterra\nG2 Crowd\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\n\n

\"\"

","summary":"An interview about the challenges of tracking the customer journey for B2B companies and how Dreamdata is addressing the problem with data integration","date_published":"2020-05-25T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/246e5188-c9a3-419a-87e7-e1ccd85dd7aa.mp3","mime_type":"audio/mpeg","size_in_bytes":25277728,"duration_in_seconds":2819}]},{"id":"podlove-2020-05-17t23:26:20+00:00-ec2ab736735f192","title":"Power Up Your PostgreSQL Analytics With Swarm64","url":"https://www.dataengineeringpodcast.com/swarm64-postgresql-analytics-episode-133","content_text":"Summary\nThe PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.\nYour host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Swarm64 is?\n\nHow did the business get started and what keeps you motivated?\n\n\nWhat are some of the common bottlenecks that users of postgres run into?\nWhat are the use cases and workloads that gain the most benefit from increased parallelism in the database engine?\nBy increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer?\nCan you describe how Swarm64 is implemented?\n\nHow has the product evolved since you first began working on it?\n\n\nHow has the evolution of postgres impacted your product direction?\n\nWhat are some of the notable challenges that you have dealt with as a result of upstream changes in postgres?\n\n\nHow has the hardware landscape evolved and how does that affect your prioritization of features and improvements?\nWhat are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64?\n\nWhich extensions conflict with yours and how does that impact potential adoption?\n\n\nIn addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture?\n\nWhat are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA?\nMost people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA?\n\n\nFor users who are adopting Swarm64, how does it impact the way they should be thinking of their data models?\nWhat is involved in migrating an existing database to use Swarm64?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the product and business of Swarm64?\nWhen is Swarm64 the wrong choice?\nWhat do you have planned for the future of Swarm64?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSwarm64\nLufthansa Cargo\nIBM Cognos Analytics\nOLAP Cube\nPostgreSQL\nGeospatial Data\nTimescaleDB\n\nPodcast Episode\n\n\nFPGA == Field Programmable Gate Array\nGreenplum\nForeign Data Tables\nPostgreSQL Table Storage API\nEnterpriseDB\nXilinx\nOVH Cloud\nNimbix\nAzure\nTableau\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Swarm64 CEO Thomas Richter about optimizing PostgreSQL on high performance hardware and FPGAs for analytical workloads.","date_published":"2020-05-18T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e195fcf3-a3c6-4c73-b4a3-93ae5c3cbd47.mp3","mime_type":"audio/mpeg","size_in_bytes":43378657,"duration_in_seconds":3163}]},{"id":"podlove-2020-05-10t23:43:56+00:00-7bb85a08719deb4","title":"StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar","url":"https://www.dataengineeringpodcast.com/streamnative-pulsar-streaming-data-episode-132","content_text":"Summary\nThere have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.\nYour host is Tobias Macey and today I’m interviewing Sijie Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at StreamNative\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what Pulsar is?\n\nHow did you get involved with the project?\n\n\nWhat is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools?\nHow has the Pulsar project evolved or changed over the past 2 years?\n\nHow has the overall state of the ecosystem influenced the direction that Pulsar has taken?\n\n\nOne of the critical elements in the success of a piece of technology is the ecosystem that grows around it. How has the community responded to Pulsar, and what are some of the barriers to adoption?\n\nHow are you and other project leaders addressing those barriers?\n\n\nYou were a co-founder at Streamlio, which was built on top of Pulsar, and now you have founded StreamNative to offer Pulsar as a service. What did you learned from your time at Streamlio that has been most helpful in your current endeavor?\n\nHow would you characterize your relationship with the project and community in each role?\n\n\nWhat motivates you to dedicate so much of your time and enery to Pulsar in particular, and the streaming data ecosystem in general?\n\nWhy is streaming data such an important capability?\nHow have projects such as Kafka and Pulsar impacted the broader software and data landscape?\n\n\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen Pulsar used?\nWhen is Pulsar the wrong choice?\nWhat do you have planned for the future of StreamNative?\n\nContact Info\n\nLinkedIn\n@sijieg on Twitter\nsijie on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nApache Pulsar\n\nPodcast Episode\n\n\nStreamNative\nStreamlio\nHadoop\nHBase\nHive\nTencent\nYahoo\nBookKeeper\nPublish/Subscribe\nKafka\nZookeeper\n\nPodcast Episode\n\n\nKafka Connect\nPulsar Functions\nPulsar IO\nKafka On Pulsar\n\nWebinar Video\n\n\nPulsar Protocol Handler\nOVH Cloud\nOpen Messaging\nActiveMQ\nKubernetes\nHelm\nPulsar Helm Charts\nGrafana\nBestPay(?)\nLambda Architecture\nEvent Sourcing\nWebAssembly\nApache Flink\n\nPodcast Episode\n\n\nPulsar Summit\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with StreamNative co-founder Sijie Guo about his experience contributing to the Pulsar framework for streaming data and its community.","date_published":"2020-05-11T12:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b0c82102-06ae-4b97-b5a2-289b30b84400.mp3","mime_type":"audio/mpeg","size_in_bytes":46437331,"duration_in_seconds":3319}]},{"id":"podlove-2020-05-03t22:21:43+00:00-8507e1e79c5f613","title":"Enterprise Data Operations And Orchestration At Infoworks","url":"https://www.dataengineeringpodcast.com/infoworks-data-operations-episode-131","content_text":"Summary\nData management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nFree yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production.\nYour host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you have built at Infoworks and the story of how it got started?\nWhat are the fundamental challenges that often plague organizations dealing with \"big data\"?\n\nHow do those challenges change or compound in the context of an enterprise organization?\nWhat are some of the unique needs that enterprise organizations have of their data?\n\n\nWhat are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively?\nWhat are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle?\n\nHow do you identify and prioritize the integrations that you build?\n\n\nHow is Infoworks itself architected and how has it evolved since you first built it?\nDiscoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform?\nWhat are the roles that use InfoWorks in their day-to-day?\n\nWhat does the workflow look like for each of those roles?\n\n\nCan you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage?\nWhat are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems?\n\nHow do you handle versioning of pipelines and validation of new iterations prior to production release?\nWhat are the cases where the no code, graphical paradigm for data orchestration breaks down?\n\n\nWhat are some of the most challenging, interesting, or unexpected lessons that you have learned since starting Infoworks?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nInfoWorks\nGoogle BigTable\nApache Spark\nApache Hadoop\nZynga\nData Partitioning\nInformatica\nPentaho\nTalend\nApache NiFi\nGoldenGate\nBigQuery\nChange Data Capture\n\nPodcast Episode About Debezium\n\n\nSlowly Changing Dimensions\nSnowflake DB\n\nPodcast Episode\n\n\nTableau\nData Catalog\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Amar Arsikere about the complexities of data operations at enterprise scale and the approach that Infoworks taken to make it manageable.","date_published":"2020-05-04T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/54d890d0-bc23-4ac7-baf5-86f9b7fb76a4.mp3","mime_type":"audio/mpeg","size_in_bytes":30082991,"duration_in_seconds":2753}]},{"id":"podlove-2020-04-28t02:46:55+00:00-7884f3e30845dd5","title":"Taming Complexity In Your Data Driven Organization With DataOps","url":"https://www.dataengineeringpodcast.com/dataops-organizational-complexity-episode-130","content_text":"Summary\nData is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nIf DataOps sounds like the perfect antidote to your pipeline woes, DataKitchen is here to help. DataKitchen’s DataOps Platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment. In no time, you’ll reclaim control of your data pipelines so you can start delivering business value instantly, without errors. Go to dataengineeringpodcast.com/datakitchen today to learn more and thank them for supporting the show!\nYour host is Tobias Macey and today I’m welcoming back Chris Bergh to talk about ways that DataOps principles can help to reduce organizational complexity\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nHow are typical data and analytic teams organized? What are their roles and structure?\nCan you start by giving an outline of the ways that complexity can manifest in a data organization?\n\nWhat are some of the contributing factors that generate this complexity?\nHow does the size or scale of an organization and their data needs impact the segmentation of responsibilities and roles?\n\n\nHow does this organizational complexity play out within a single team? For example between data engineers, data scientists, and production/operations?\nHow do you approach the definition of useful interfaces between different roles or groups within an organization?\n\nWhat are your thoughts on the relationship between the multivariate complexities of data and analytics workflows and the software trend toward microservices as a means of addressing the challenges of organizational communication patterns in the software lifecycle?\n\n\nHow does this organizational complexity play out between multiple teams?\nFor example between centralized data team and line of business self service teams?\nIsn’t organizational complexity just ‘the way it is’? Is there any how in getting out of meetings and inter team conflict?\nWhat are some of the technical elements that are most impactful in reducing the time to delivery for different roles?\nWhat are some strategies that you have found to be useful for maintaining a connection to the business need throughout the different stages of the data lifecycle?\nWhat are some of the signs or symptoms of problematic complexity that individuals and organizations should keep an eye out for?\nWhat role can automated testing play in improving this process?\nHow do the current set of tools contribute to the fragmentation of data workflows?\n\nWhich set of technologies are most valuable in reducing complexity and fragmentation?\n\n\nWhat advice do you have for data engineers to help with addressing complexity in the data organization and the problems that it contributes to?\n\nContact Info\n\nLinkedIn\n@ChrisBergh on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataKitchen\nDataOps\nNASA Ames Research Center\nExcel\nTableau\nLooker\n\nPodcast Episode\n\n\nAlteryx\nTrifacta\nPaxata\nAutoML\nInformatica\nSAS\nConway’s Law\nRandom Forest\nK-Means Clustering\nGraphQL\nMicroservices\nIntuit Superglue\nAmundsen\n\nPodcast Episode\n\n\nMaster Data Management\n\nPodcast Episode\n\n\nHadoop\nGreat Expectations\n\nPodcast Episode\n\n\nObservability\nContinuous Integration\nContinuous Delivery\nW. Edwards Deming\nThe Joel Test\nJoel Spolsky\nDataOps Blog\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about using a DataOps approach to reduce the technical and organizational complexity that occurs in data driven organizations.","date_published":"2020-04-27T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/be1d5ab1-7763-4db0-94df-97f3374cc2be.mp3","mime_type":"audio/mpeg","size_in_bytes":50807097,"duration_in_seconds":3708}]},{"id":"podlove-2020-04-20t01:37:15+00:00-9151eacb8d1ec81","title":"Building Real Time Applications On Streaming Data With Eventador","url":"https://www.dataengineeringpodcast.com/eventador-streaming-data-episode-129","content_text":"Summary\nModern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYour host is Tobias Macey and today I’m interviewing Kenny Gorman about the Eventador streaming SQL platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what the Eventador platform is and the story\nbehind it?\n\nHow has your experience at ObjectRocket influenced your approach to streaming SQL?\nHow do the capabilities and developer experience of Eventador compare to other streaming SQL engines such as ksqlDB, Pulsar SQL, or Materialize?\n\n\nWhat are the main use cases that you are seeing people use for streaming SQL?\n\nHow does it fit into an application architecture?\nWhat are some of the design changes in the different layers that are necessary to take advantage of the real time capabilities?\n\n\nCan you describe how the Eventador platform is architected?\n\nHow has the system design evolved since you first began working on it?\nHow has the overall landscape of streaming systems changed since you first began working on Eventador?\nIf you were to start over today what would you do differently?\n\n\nWhat are some of the most interesting and challenging operational aspects of running your platform?\nWhat are some of the ways that you have modified or augmented the SQL dialect that you support?\n\nWhat is the tipping point for when SQL is insufficient for a given task and a user might want to leverage Flink?\n\n\nWhat is the workflow for developing and deploying different SQL jobs?\n\nHow do you handle versioning of the queries and integration with the software development lifecycle?\n\n\nWhat are some data modeling considerations that users should be aware of?\n\nWhat are some of the sharp edges or design pitfalls that users should be aware of?\n\n\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen your customers use your platform?\nWhat are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of building and scaling Eventador?\nWhat do you have planned for the future of the platform?\n\nContact Info\n\nLinkedIn\nBlog\n@kennygorman on Twitter\nkgorman on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nEventador\nOracle DB\nPaypal\nEBay\nSemaphore\nMongoDB\nObjectRocket\nRackSpace\nRethinkDB\nApache Kafka\nPulsar\nPostgreSQL Write-Ahead Log (WAL)\nksqlDB\n\nPodcast Episode\n\n\nPulsar SQL\nMaterialize\n\nPodcast Episode\n\n\nPipelineDB\n\nPodcast Episode\n\n\nApache Flink\n\nPodcast Episode\n\n\nTimely Dataflow\nFinTech == Financial Technology\nAnomaly Detection\nNetwork Security\nMaterialized View\nKubernetes\nConfluent Schema Registry\n\nPodcast Episode\n\n\nANSI SQL\nApache Calcite\nPostgreSQL\nUser Defined Functions\nChange Data Capture\n\nPodcast Episode\n\n\nAWS Kinesis\nUber AthenaX\nNetflix Keystone\nVerverica\nRockset\n\nPodcast Episode\n\n\nBackpressure\nKeen.io\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications","date_published":"2020-04-19T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/61489957-064b-4695-9b93-8009cf65a6d7.mp3","mime_type":"audio/mpeg","size_in_bytes":44145861,"duration_in_seconds":3030}]},{"id":"podlove-2020-04-13t14:02:37+00:00-7cd829deca2dec6","title":"Making Data Collection In Your Code Easy With Rookout","url":"https://www.dataengineeringpodcast.com/rookout-data-collection-episode-128","content_text":"Summary\nThe software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYour host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the types of data that we typically collect for the systems operations context?\n\nWhat are some of the business questions that can be answered from these data sources?\n\n\nWhat are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages?\n\nWhat are some effective strategies that you have found for including business stake holders in the process of defining these collection points?\n\n\nOne of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics?\n\nWhat are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes?\n\n\nHow does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information?\nThe types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data?\nWhat are some of the other sources of dark data that we should keep an eye out for in our organizations?\n\nContact Info\n\nLinkedIn\n@Liran_Last on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nRookout\nCybersecurity\nDevOps\nDataDog\nGraphite\nElasticsearch\nLogz.io\nKafka\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Rookout's CTO about the importance of including non-technical roles in the data collection process","date_published":"2020-04-13T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f996e1b6-731b-4c41-9074-e8a4b6a2f8fd.mp3","mime_type":"audio/mpeg","size_in_bytes":23277265,"duration_in_seconds":1560}]},{"id":"podlove-2020-04-06t23:42:06+00:00-d09e01d6406384d","title":"Building A Knowledge Graph Of Commercial Real Estate At Cherre","url":"https://www.dataengineeringpodcast.com/cherre-knowledge-graph-episode-127","content_text":"Summary\nKnowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Cherre is and the role that data plays in the business?\nWhat are the benefits of a knowledge graph for making real estate investment decisions?\nWhat are the main ways that you and your customers are using the knowledge graph?\n\nWhat are some of the challenges that you face in providing a usable interface for end-users to query the graph?\n\n\nWhat technology are you using for storing and processing the graph?\n\nWhat challenges do you face in scaling the complexity and analysis of the graph?\n\n\nWhat are the main sources of data for the knowledge graph?\nWhat are some of the ways that messiness manifests in the data that you are using to populate the graph?\n\nHow are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure?\nHow do you handle missing attributes or extra attributes in a given record?\n\n\nHow did you approach the process of determining an effective taxonomy for records in the graph?\nWhat is involved in performing entity extraction on your data?\nWhat are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph?\nWhat are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data?\nWhat are some of the near and medium term improvements that you have planned for your knowledge graph?\nWhat advice do you have for anyone who is interested in building a knowledge graph of their own?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nCherre\nCommercial Real Estate\nKnowledge Graph\nRDF Triple\nDGraph\n\nPodcast Interview\n\n\nNeo4J\nTigerGraph\nGoogle BigQuery\nApache Spark\n\nSpark In Action Episode\n\n\nEntity Extraction/Named Entity Recognition\nNetworkX\nSpark Graph Frames\nGraph Embeddings\nAirflow\n\nPodcast.__init__ Interview\n\n\nDBT\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Cherre builds and maintains a knowledge graph of commercial real estate data and how it enables them to answer valuable questions","date_published":"2020-04-06T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ea1f3374-a3f0-4f49-b3d4-390287b30fa4.mp3","mime_type":"audio/mpeg","size_in_bytes":39311999,"duration_in_seconds":2720}]},{"id":"podlove-2020-03-30t22:54:55+00:00-941696dcf46d67d","title":"The Life Of A Non-Profit Data Professional","url":"https://www.dataengineeringpodcast.com/non-profit-data-engineering-episode-126","content_text":"Summary\nBuilding and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference and ODSC East which has also gone virtual. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Tyler Colby about his experiences working as a data professional in the non-profit arena, most recently at the Natural Resources Defense Council\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing your responsibilities as the director of data infrastructure at the NRDC?\nWhat specific challenges are you facing at the NRDC?\nCan you describe some of the types of data that you are working with at the NRDC?\n\nWhat types of systems are you relying on for the source of your data?\n\n\nWhat kinds of systems have you put in place to manage the data needs of the NRDC?\n\nWhat are your biggest influences in the build vs. buy decisions that you make?\nWhat heuristics or guidelines do you rely on for aligning your work with the business value that it will produce and the broader mission of the organization?\n\n\nHave you found there to be any extra scrutiny of your work as a member of a non-profit in terms of regulations or compliance questions?\nYour career has involved a significant focus on the Salesforce platform. For anyone not familiar with it, what benefits does it provide in managing information flows and analysis capabilities?\n\nWhat are some of the most challenging or complex aspects of working with Saleseforce?\n\n\nIn light of the current global crisis posed by COVID-19 you have established a new non-profit entity to organize the efforts of various technical professionals. Can you describe the nature of that mission?\n\nWhat are some of the unique data challenges that you anticipate or have already encountered?\nHow do the data challenges of this new organization compare to your past experiences?\n\n\nWhat have you found to be most useful or beneficial in the current landscape of data management systems and practices in your career with non-profit organizations?\n\nWhat are the areas that need to be addressed or improved for workers in the non-profit sector?\n\n\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nNRDC\nAWS Redshift\nTime Warner Cable\nSalesforce\nCloud For Good\nTableau\nCivis Analytics\nEveryAction\nBlackBaud\nActionKit\nMobileCommons\nXKCD 1667\nGDPR == General Data Privacy Regulation\nCCPA == California Consumer Privacy Act\nSalesforce Apex\nSalesforce.org\nSalesforce Non-Profit Success Pack\nValidity\nOpenRefine\nJitterBit\nSkyvia\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Tyler Colby about his experiences working as a data professional in the non-profit sector and the challenges that are unique to that domain","date_published":"2020-03-30T19:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5425bfdd-9a08-4e0d-8754-e9d61960ebab.mp3","mime_type":"audio/mpeg","size_in_bytes":31190447,"duration_in_seconds":2676}]},{"id":"podlove-2020-03-23t01:00:14+00:00-f2780867f81c2d1","title":"Behind The Scenes Of The Linode Object Storage Service","url":"https://www.dataengineeringpodcast.com/linode-object-storage-service-episode-125","content_text":"Summary\nThere are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the current state of your object storage product?\n\nWhat was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze?\n\n\nWhat is the scale and scope of usage that you had to design for?\nCan you describe how your platform is implemented?\n\nWhat was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch?\nHow have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public?\n\n\nWhat have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode?\nWhat are the most important capabilities for the underlying hardware that you are running on?\nWhat supporting systems and tools are you using to manage the availability and durability of your object storage?\nHow did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage?\nWhat are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams?\nWhat are your thoughts on the state of the S3 API as a de facto standard for object storage?\nWhat is your main focus now that object storage is being rolled out to more data centers?\n\nContact Info\n\nDorthu on GitHub\ndorthu22 on Twitter\nLinkedIn\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nLinode Object Storage\nXen Hypervisor\nKVM (Linux Kernel Virtual Machine)\nLinode API V4\nCeph Distributed Filesystem\n\nPodcast Episode\n\n\nWasabi\nBackblaze\nMinIO\nCERN Ceph Scaling Paper\nRADOS Gateway\nOpenResty\nLua\nPrometheus\nLinode Managed Kubernetes\nCeph Swift Protocol\nCeph Bug Tracker\nLinode Dashboard Application Source Code\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with the project lead for Linode's recently released object storage service about the challenges involved in building a provider grade S3 compatible service.","date_published":"2020-03-23T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/63877a07-e96b-47f8-aa17-8d49056ccd6a.mp3","mime_type":"audio/mpeg","size_in_bytes":24490785,"duration_in_seconds":2153}]},{"id":"podlove-2020-03-16t23:20:14+00:00-2c898467c61ba55","title":"Building A New Foundation For CouchDB","url":"https://www.dataengineeringpodcast.com/couchdb-document-database-episode-124","content_text":"Summary\nCouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAre you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!\nSetting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you starty by describing what CouchDB is?\n\nHow did you get involved in the CouchDB project and what is your current role in the community?\n\n\nWhat are the use cases that it is well suited for?\nCan you share some of the history of CouchDB and its role in the NoSQL movement?\nHow is CouchDB currently architected and how has it evolved since it was first introduced?\nWhat have been the benefits and challenges of Erlang as the runtime for CouchDB?\nHow is the current storage engine implemented and what are its shortcomings?\nWhat problems are you trying to solve by replatforming on a new storage layer?\n\nWhat were the selection criteria for the new storage engine and how did you structure the decision making process?\nWhat was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?\n\n\nHow is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB?\nHow will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication?\nWhat will the migration path be for people running an existing installation?\nWhat are some of the biggest challenges that you are facing in rearchitecting the codebase?\nWhat new capabilities will the FoundationDB storage layer enable?\nWhat are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?\n\nWhat new capabilities or use cases do you anticipate once this migration is complete?\n\n\nWhat are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community?\nWhat is in store for the future of CouchDB?\n\nContact Info\n\nLinkedIn\n@kocolosk on Twitter\nkocolosk on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nApache CouchDB\nFoundationDB\n\nPodcast Episode\n\n\nIBM\nCloudant\nExperimental Particle Physics\nFPGA == Field Programmable Gate Array\nApache Software Foundation\nCRDT == Conflict-free Replicated Data Type\n\nPodcast Episode\n\n\nErlang\nRiak\nRabbitMQ\nHeisenbug\nKubernetes\nProperty Based Testing\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the CouchDB document database and the work being done to rearchitect it to run on top of FoundationDB","date_published":"2020-03-16T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5ac19716-70ac-4912-8bab-4dccf8078290.mp3","mime_type":"audio/mpeg","size_in_bytes":37482023,"duration_in_seconds":3325}]},{"id":"podlove-2020-03-09t01:12:56+00:00-4e39f036bcc1c8a","title":"Scaling Data Governance For Global Businesses With A Data Hub Architecture","url":"https://www.dataengineeringpodcast.com/data-hub-architecture-data-governance-episode-123","content_text":"Summary\nData governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the goals of a data hub architecture?\nWhat are the elements of a data hub architecture and how do they contribute to the overall goals?\n\nWhat are some of the patterns or reference architectures that you drew on to develop this approach?\n\n\nWhat are some signs that an organization should implement a data hub architecture?\nWhat is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access?\nWhat are the features or attributes of an individual hub that allow for them to be interconnected?\n\nWhat is the interface presented between hubs to allow for accessing information across these localized repositories?\n\n\nWhat is the process for adding a new hub and making it discoverable across the organization?\nHow is discoverability of data managed within and between hubs?\nIf someone wishes to access information between hubs or across several of them, how do you prevent data proliferation?\n\nIf data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity?\nHow are access controls and data masking managed to ensure that various compliance regimes are honored?\nIn addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs?\n\n\nGiven that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs?\n\nHow do you address issues of data loss or corruption within those transformations?\n\n\nHow is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.?\nHow do you manage tracking and reporting of data lineage within and across hubs?\nFor an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub?\n\nWhat are some of the considerations and useful technologies that would assist in creating and connecting hubs?\n\nShould the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface?\n\n\n\n\nWhen is a data hub architecture the wrong approach?\n\nContact Info\n\nLinkedIn\n@jerrong on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCluedIn\n\nPodcast Episode\nEventual Connectivity Episode\n\n\nFuturama\nKubernetes\nZookeeper\n\nPodcast Episode\n\n\nData Governance\nData Lineage\nData Sovereignty\nGraph Database\nHelm Chart\nApplication Container\nDocker Compose\nLinkedIn DataHub\nUdemy\nPluralSight\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how a data hub architecture can reduce the overhead of managing data governance and compliance across an organization","date_published":"2020-03-09T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/70a7a51a-6948-4a1c-a2d0-c254f5dc810a.mp3","mime_type":"audio/mpeg","size_in_bytes":38144802,"duration_in_seconds":3248}]},{"id":"podlove-2020-03-02t13:55:33+00:00-e34bfcfa1206fb8","title":"Easier Stream Processing On Kafka With ksqlDB","url":"https://www.dataengineeringpodcast.com/ksqldb-kafka-stream-processing-episode-122","content_text":"Summary\nBuilding applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAre you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what ksqlDB is?\nWhat are some of the use cases that it is designed for?\nHow do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize?\nWhat was the motivation for building a unified project for providing a database interface on the data stored in Kafka?\nHow is ksqlDB architected?\n\nIf you were to rebuild the entire platform and its components from scratch today, what would you do differently?\n\n\nWhat is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?\n\nWhat dialect of SQL is supported?\n\nWhat kinds of extensions or built in functions have been added to aid in the creation of streaming queries?\n\n\n\n\nHow are table schemas defined and enforced?\n\nHow do you handle schema migrations on active streams?\n\n\nTypically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB?\nCan you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence?\nWhat are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized?\nWhat are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications?\nWhat is involved in deploying and maintaining an installation of ksqlDB?\n\nWhat are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture?\n\n\nWhen is ksqlDB the wrong choice?\nWhat are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB?\nWhat are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB?\nWhat is in store for the future of the project?\n\nContact Info\n\n@michaeldrogalis on Twitter\nmichaeldrogalis on GitHub\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nksqlDB\nConfluent\nErlang\nOnyx\nApache Storm\nStream Processing\nKafka\nksql\nKafka Streams\nPulsar\n\nPodcast Episode\n\n\nPulsar SQL\nPipelineDB\n\nPodcast Episode\n\n\nMaterialize\n\nPodcast Episode\n\n\nKafka Connect\nRocksDB\nJava Jar\nCLI == Command Line Interface\nPrestoDB\n\nPodcast Episode\n\n\nANSI SQL\nPravega\n\nPodcast Episode\n\n\nEventual Consistency\nConfluent Cloud\nMySQL\nPostgreSQL\nGraphQL\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the ksqlDB platform and the unified experience that it provides for building stream processing applications on top of Kafka with SQL.","date_published":"2020-03-02T13:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b3887346-8930-4605-aaf8-49a682bbbf8b.mp3","mime_type":"audio/mpeg","size_in_bytes":31830725,"duration_in_seconds":2616}]},{"id":"podlove-2020-02-25t03:55:33+00:00-d472b56b727fa0e","title":"Shining A Light on Shadow IT In Data And Analytics","url":"https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121","content_text":"Summary\nMisaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAre you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your definition of shadow IT?\nWhat are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?\n\nWhat are some of the roles in an organization that you have seen involved in these shadow IT projects?\n\n\nWhat kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?\n\nWhat are some of the pitfalls that these solutions present as a result of their initial ease of use?\n\n\nWhat are the benefits to the organization of individuals or teams building and managing their own solutions?\nWhat are some of the risks associated with these implementations of data collection, storage, management, or analysis that have no oversight from the teams typically tasked with managing those systems?\n\nWhat are some of the ways that compliance or data quality issues can arise from these projects?\n\n\nOnce a project has been started outside of the approved channels it can quickly take on a life of its own. What are some of the ways you have identified the presence of \"unauthorized\" data projects?\n\nOnce you have identified the existence of such a project how can you revise their implementation to integrate them with the \"approved\" platform that the organization supports?\n\n\nWhat are some strategies for removing the friction in the collection, access, or availability of data in an organization that can eliminate the need for shadow IT implementations?\nWhat are some of the inherent complexities in data management which you would like to see resolved in order to reduce the tensions that lead to these bespoke solutions?\n\nContact Info\n\nSean\n\nLinkedIn\n@seanknapp on Twitter\n\n\nCharlie\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nShadow IT\nAscend\n\nPodcast Episode\n\n\nZoneHaven\nGoogle Sawzall\nM&A == Mergers and Acquisitions\nDevOps\nWaterfall Development\nData Governance\nData Lineage\nPioneers, Settlers, and Town Planners\nPowerBI\nTableau\nExcel\nAmundsen\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation about the conflicts that lead to shadow IT in data and analytics projects and how to work toward resolving those tensions.","date_published":"2020-02-24T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/70a8c7e6-e3f2-485e-8e69-c77c54507be6.mp3","mime_type":"audio/mpeg","size_in_bytes":36949267,"duration_in_seconds":2768}]},{"id":"podlove-2020-02-18t02:49:14+00:00-63a181af968d42b","title":"Data Infrastructure Automation For Private SaaS At Snowplow","url":"https://www.dataengineeringpodcast.com/snowplow-data-infrastructure-automation-episode-120","content_text":"Summary\nOne of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of the components in your system architecture and the nature of your managed service?\nWhat are some of the challenges that are inherent to private SaaS nature of your managed service?\nWhat elements of your system require the most attention and maintenance to keep them running properly?\nWhich components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity?\nHow do you manage deployment of the full Snowplow pipeline for your customers?\n\nHow has your strategy for deployment evolved since you first began Soffering the managed service?\nHow has the architecture of the pipeline evolved to simplify operations?\n\n\nHow much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?\n\nWhat are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?\n\nHow does that reflect in the tooling that you use to manage their deployments?\n\n\n\n\nWhat types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly?\nWhat are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow?\nWhat are some lessons that you can generalize for management of data infrastructure more broadly?\nIf you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently?\nWhat do you have planned for the future of the Snowplow product and infrastructure management?\n\nContact Info\n\nLinkedIn\njbeemster on GitHub\n@jbeemster1 on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSnowplow Analytics\n\nPodcast Episode\n\n\nTerraform\nConsul\nNomad\nMeltdown Vulnerability\nSpectre Vulnerability\nAWS Kinesis\nElasticsearch\nSnowflakeDB\nIndicative\nS3\nSegment\nAWS Cloudwatch\nStackdriver\nApache Kafka\nApache Pulsar\nGoogle Cloud PubSub\nAWS SQS\nAWS SNS\nAWS Redshift\nAnsible\nAWS Cloudformation\nKubernetes\nAWS EMR\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Snowplow Analytics tech lead about how they manage data infrastructure for streaming events across multiple clouds","date_published":"2020-02-17T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/03208892-9c49-4809-8e09-f05e25f4033c.mp3","mime_type":"audio/mpeg","size_in_bytes":31763920,"duration_in_seconds":2941}]},{"id":"podlove-2020-02-09t19:31:39+00:00-a3604de40898967","title":"Data Modeling That Evolves With Your Business Using Data Vault","url":"https://www.dataengineeringpodcast.com/data-vault-data-modeling-episode-119","content_text":"Summary\nDesigning the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nSetting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?\n\nWhat is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome?\nHow did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?\n\n\nWhat are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure?\nWhat are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses?\n\nHow has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?\nIs there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)?\n\n\nWhat are the steps for establishing and evolving a data vault model in an organization?\n\nHow does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading?\n\n\nWhat are some of the changes in query structure that consumers of the model will need to plan for?\nAre there any performance or complexity impacts imposed by the data vault approach?\nCan you talk through the overall lifecycle of data in a data vault modeled warehouse?\n\nHow does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model?\n\n\nWhat are some cases where a data vault approach doesn’t fit the needs of an organization or application?\nFor listeners who want to learn more, what are some references or exercises that you recommend?\n\nContact Info\n\nWebsite\nLinkedIn\n@KentGraziano on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nSnowflakeDB\nData Vault Modeling\nData Warrior Blog\nOLTP == On-Line Transaction Processing\nData Warehouse\nBill Inmon\nClaudia Imhoff\nOracle DB\nThird Normal Form\nStar Schema\nSnowflake Schema\nRelational Theory\nSixth Normal Form\nDenormalization\nPivot Table\nDan Linstedt\nTDAN.com\nRalph Kimball\nAgile Manifesto\nSchema On Read\nData Lake\nHadoop\nNoSQL\nData Vault Conference\nTeradata\nODS (Operational Data Store) Model\nSupercharge Your Data Warehouse (affiliate link)\nBuilding A Scalable Data Warehouse With Data Vault 2.0 (affiliate link)\nData Model Resource Book (affiliate link)\nData Warehouse Toolkit (affiliate link)\nBuilding The Data Warehouse (affiliate link)\nDan Linstedt Blog\nPerforrmance G2\nScale Free European Classes\nCertus Australian Classes\nWherescape\nErwin\nVaultSpeed\nData Vault Builder\nVarigence BimlFlex\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the data vault method of data modeling and how it simplifies integrating the evolving data sources that you are dealing with in your enterprise data warehouse","date_published":"2020-02-09T14:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fe75bbf0-3a37-4d6c-97f2-579f0d0005f4.mp3","mime_type":"audio/mpeg","size_in_bytes":44856434,"duration_in_seconds":3981}]},{"id":"podlove-2020-02-03t19:23:46+00:00-90d121f033f846c","title":"The Benefits And Challenges Of Building A Data Trust","url":"https://www.dataengineeringpodcast.com/brighthive-data-trust-episode-118","content_text":"Summary\nEvery business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more!\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Tom Plagge and Gregory Mundy about BrightHive, a platform for building data trusts\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what a data trust is?\n\nWhy might an organization want to build one?\n\n\nWhat is BrightHive and what is its origin story?\nBeyond having a storage location with access controls, what are the components of a data trust that are necessary for them to be viable?\nWhat are some of the challenges that are common in establishing an agreement among organizations who are participating in a data trust?\n\nWhat are the responsibilities of each of the participants in a data trust?\nFor an individual or organization who wants to participate in an existing trust, what is involved in gaining access?\n\n\nHow does BrightHive support the process of building a data trust?\nHow is ownership of derivative data sets/data products and associated intellectual property handled in the context of a trust?\nHow is the technical architecture of BrightHive implemented and how has it evolved since it first started?\nWhat are some of the ways that you approach the challenge of data privacy in these sharing agreements?\nWhat are some legal and technical guards that you implement to encourage ethical uses of the data contained in a trust?\nWhat is the motivation for releasing the technical elements of BrightHive as open source?\nWhat are some of the most interesting, innovative, or inspirational ways that you have seen BrightHive used?\nBeing a shared platform for empowering other organizations to collaborate I imagine there is a strong focus on long-term sustainability. How are you approaching that problem and what is the business model for BrightHive?\nWhat have you found to be the most interesting/unexpected/challenging aspects of building and growing the technical and business infrastructure of BrightHive?\nWhat do you have planned for the future of BrightHive?\n\nContact Info\n\nTom\n\nLinkedIn\ntplagge on GitHub\n\n\nGregory\n\nLinkedIn\ngregmundy on GitHub\n@graygoree on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nBrightHive\nData Science For Social Good\nWorkforce Data Initiative\nNASA\nNOAA\nData Trust\nData Collaborative\nPublic Benefit Corporation\nTerraform\nAirflow\n\nPodcast.__init__ Episode\n\n\nDagster\n\nPodcast Episode\n\n\nSecure Multi-Party Computation\nPublic Key Encryption\nAWS Macie\nBlockchain\nSmart Contracts\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more!

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the BrightHive platform for building data trusts and the complexities that are inherent in sharing data across organizations","date_published":"2020-02-03T15:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9a7ed645-f9bf-42c8-986b-3022d538e976.mp3","mime_type":"audio/mpeg","size_in_bytes":39360087,"duration_in_seconds":3412}]},{"id":"podlove-2020-01-26t22:01:27+00:00-0319bed1e4d1b79","title":"Pay Down Technical Debt In Your Data Pipeline With Great Expectations","url":"https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117","content_text":"Summary\nData pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Great Expecations is and the origin of the project?\n\nWhat has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago?\n\n\nPrior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them?\nWhat are some of the types of checks and assertions that can be made about a pipeline using Great Expectations?\n\nWhat are some of the non-obvious use cases for Great Expectations?\n\n\nWhat aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion?\nCan you describe how Great Expectations is implemented?\nFor anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments?\nWhat are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering?\nCan you talk through some of the ways that Great Expectations can be extended?\nWhat are some notable extensions or integrations of Great Expectations?\nBeyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations?\nWhat are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used?\nWhat are the limitations of Great Expectations?\nWhat are some cases where Great Expectations would be the wrong choice?\nWhat do you have planned for the future of Great Expectations?\n\nContact Info\n\nLinkedIn\n@jpcampbell42 on Twitter\njcampbell on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nGreat Expectations\n\nGitHub\nTwitter\n\n\nPodcast.__init__ Interview on Great Expectations\nSuperconductive Health\nAbe Gong\nPandas\n\nPodcast.__init__ Interview\n\n\nSQLAlchemy\nPostgreSQL\n\nPodcast Episode\n\n\nRedShift\nBigQuery\nSpark\nCloudera\nDataBricks\nGreat Expectations Data Docs\nGreat Expectations Data Profiling\nApache NiFi\nAmazon Deequ\nTensorflow Data Validation\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Great Expectations framework helps you add meaningful tests and validation to your data pipeline to drive down technical debt","date_published":"2020-01-26T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3efa2cc0-b560-42ab-8149-454736b404e6.mp3","mime_type":"audio/mpeg","size_in_bytes":39722187,"duration_in_seconds":2791}]},{"id":"podlove-2020-01-20t01:13:36+00:00-e4ff06d2c5ebd1b","title":"Replatforming Production Dataflows","url":"https://www.dataengineeringpodcast.com/mayvenn-ascend-data-replatforming-episode-116","content_text":"Summary\nBuilding a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start off by describing what Mayvenn is and give a sense of how you are using data?\nWhat are the sources of data that you are working with?\nWhat are the biggest challenges you are facing in collecting, processing, and analyzing your data?\nBefore adopting Ascend, what did your overall platform for data management look like?\nWhat were the pain points that you were facing which led you to seek a new solution?\n\nWhat were the selection criteria that you set forth for addressing your needs at the time?\nWhat were the aspects of Ascend which were most appealing?\n\n\nWhat are some of the edge cases that you have dealt with in the Ascend platform?\nNow that you have been using Ascend for a while, what components of your previous architecture have you been able to retire?\nCan you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent?\nHow has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics?\nWhat are your future plans for how to use data across your organization?\n\nContact Info\n\nSheel\n\nLinkedIn\nsheelc on GitHub\n\n\nSean\n\nLinkedIn\n@seanknapp on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMayvenn\nAscend\n\nPodcast Episode\n\n\nGoogle Sawzall\nClickstream\nApache Kafka\nAlooma\n\nPodcast Episode\n\n\nAmazon Redshift\nELT == Extract, Load, Transform\nDBT\n\nPodcast Episode\n\n\nAmazon Data Pipeline\nUpsolver\nPentaho\nStitch Data\nFivetran\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Mayvenn replatformed their production dataflows using Ascend and improved their ability to deliver meaningful analytics to their business","date_published":"2020-01-20T10:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a7fa7cf1-8de5-4d25-a1ec-4a6aff8359be.mp3","mime_type":"audio/mpeg","size_in_bytes":30058004,"duration_in_seconds":2340}]},{"id":"podlove-2020-01-13t21:11:56+00:00-211ad9fa8ee6e50","title":"Planet Scale SQL For The New Generation Of Applications With YugabyteDB","url":"https://www.dataengineeringpodcast.com/yugabytedb-planet-scale-sql-episode-115","content_text":"Summary\nThe modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Karthik Ranganathan about YugabyteDB, the open source, high-performance distributed SQL database for global, internet-scale apps.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what YugabyteDB is and its origin story?\nA growing trend in database engines (e.g. FaunaDB, CockroachDB) has been an out of the box focus on global distribution. Why is that important and how does it work in Yugabyte?\n\nWhat are the caveats?\n\n\nWhat are the most notable features of YugabyteDB that would lead someone to choose it over any of the myriad other options?\n\nWhat are the use cases that it is uniquely suited to?\n\n\nWhat are some of the systems or architecture patterns that can be replaced with Yugabyte?\nHow does the design of Yugabyte or the different ways it is being used influence the way that users should think about modeling their data?\nYugabyte is an impressive piece of engineering. Can you talk through the major design elements and how it is implemented?\nEasy scaling and failover is a feature that many database engines would like to be able to claim. What are the difficult elements that prevent them from implementing that capability as a standard practice?\n\nWhat do you have to sacrifice in order to support the level of scale and fault tolerance that you provide?\n\n\nSpeaking of scaling, there are many ways to define that term, from vertical scaling of storage or compute, to horizontal scaling of compute, to scaling of reads and writes. What are the primary scaling factors that you focus on in Yugabyte?\nHow do you approach testing and validation of the code given the complexity of the system that you are building?\nIn terms of the query API you have support for a Postgres compatible SQL dialect as well as a Cassandra based syntax. What are the benefits of targeting compatibility with those platforms?\n\nWhat are the challenges and benefits of maintaining compatibility with those other platforms?\n\n\nCan you describe how the storage layer is implemented and the division between the different query formats?\nWhat are the operational characteristics of YugabyteDB?\n\nWhat are the complexities or edge cases that users should be aware of when planning a deployment?\n\n\nOne of the challenges of working with large volumes of data is creating and maintaining backups. How does Yugabyte handle that problem?\nMost open source infrastructure projects that are backed by a business withhold various \"enterprise\" features such as backups and change data capture as a means of driving revenue. Can you talk through your motivation for releasing those capabilities as open source?\nWhat is the business model that you are using for YugabyteDB and how does it differ from the tribal knowledge of how open source companies generally work?\nWhat are some of the most interesting, innovative, or unexpected ways that you have seen yugabyte used?\nWhen is Yugabyte the wrong choice?\nWhat do you have planned for the future of the technical and business aspects of Yugabyte?\n\nContact Info\n\n@karthikr on Twitter\nLinkedIn\nrkarthik007 on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nYugabyteDB\n\nGitHub\n\n\nNutanix\nFacebook Engineering\nApache Cassandra\nApache HBase\nDelphi\nFuanaDB\n\nPodcast Episode\n\n\nCockroachDB\n\nPodcast Episode\n\n\nHA == High Availability\nOracle\nMicrosoft SQL Server\nPostgreSQL\n\nPodcast Episode\n\n\nMongoDB\nAmazon Aurora\nPGCrypto\nPostGIS\npl/pgsql\nForeign Data Wrappers\nPipelineDB\n\nPodcast Episode\n\n\nCitus\n\nPodcast Episode\n\n\nJepsen Testing\nYugabyte Jepsen Test Results\nOLTP == Online Transaction Processing\nOLAP == Online Analytical Processing\nDocDB\nGoogle Spanner\nGoogle BigTable\nSpot Instances\nKubernetes\nCloudformation\nTerraform\nPrometheus\nDebezium\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about YugabyteDB and how it was architected to power the new generation of planet scale applications","date_published":"2020-01-13T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a3269697-fa6f-44d8-9a9d-67cb9624bea3.mp3","mime_type":"audio/mpeg","size_in_bytes":49429996,"duration_in_seconds":3676}]},{"id":"podlove-2020-01-06t02:58:04+00:00-8c75e1e3c37b13c","title":"Change Data Capture For All Of Your Databases With Debezium","url":"https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114","content_text":"Summary\nDatabases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Change Data Capture is and some of the ways that it can be used?\nWhat is Debezium and what problems does it solve?\n\nWhat was your motivation for creating it?\nWhat are some of the use cases that it enables?\nWhat are some of the other options on the market for handling change data capture?\n\n\nCan you describe the systems architecture of Debezium and how it has evolved since it was first created?\n\nHow has the tight coupling with Kafka impacted the direction and capabilities of Debezium?\nWhat, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)?\n\n\nWhat are the data sources that are supported by Debezium?\n\nGiven that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality?\n\n\nWhat is involved in deploying, integrating, and maintaining an installation of Debezium?\n\nWhat are the scaling factors?\nWhat are some of the edge cases that users and operators should be aware of?\n\n\nDebezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information?\n\nWhat are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets?\n\n\nWhat are some of the common downstream systems that consume the outputs of Debezium?\n\nWhat challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support?\n\n\nWhat are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used?\nWhat have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium?\nWhat is in store for the future of Debezium?\n\nContact Info\n\nRandall\n\nLinkedIn\n@rhauch on Twitter\nrhauch on GitHub\n\n\nGunnar\n\ngunnarmorling on GitHub\n@gunnarmorling on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDebezium\nConfluent\nKafka Connect\nRedHat\nBean Validation\nChange Data Capture\nDBMS == DataBase Management System\nApache Kafka\nApache Flink\n\nPodcast Episode\n\n\nYugabyte DB\nPostgreSQL\n\nPodcast Episode\n\n\nMySQL\nMicrosoft SQL Server\nApache Pulsar\n\nPodcast Episode\n\n\nPravega\n\nPodcast Episode\n\n\nNATS\nAmazon Kinesis\nPulsar IO\nWePay\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Debezium framework simplifies implementing change data capture for all of your database engines","date_published":"2020-01-05T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d455b12a-90e8-4033-b761-a2ed1049490e.mp3","mime_type":"audio/mpeg","size_in_bytes":42007019,"duration_in_seconds":3181}]},{"id":"podlove-2019-12-30t13:54:06+00:00-1bbafef6bba9319","title":"Building The DataDog Platform For Processing Timeseries Data At Massive Scale","url":"https://www.dataengineeringpodcast.com/datadog-timeseries-data-episode-113","content_text":"Summary\nDataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nFor anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with?\nWhat are the main components of your platform for managing that information?\nHow are the data teams at DataDog organized and what are your primary responsibilities in the organization?\nWhat are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?\n\nWhat are some of the strategies which have proven to be most useful in overcoming those challenges?\n\n\nWho are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met?\nGiven that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information?\nMost of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered?\nWhat are some of the upcoming projects that you have planned for the upcoming months and years?\nWhat are some of the technologies, patterns, or practices that you are hoping to adopt?\n\nContact Info\n\nLinkedIn\n@databuryat on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataDog\nHadoop\nHive\nYarn\nChef\nSRE == Site Reliability Engineer\nApplication Performance Management (APM)\nApache Kafka\nRocksDB\nCassandra\nApache Parquet data serialization format\nSLA == Service Level Agreement\nWatchDog\nApache Spark\n\nPodcast Episode\n\n\nApache Pig\nDatabricks\nJVM == Java Virtual Machine\nKubernetes\nSSIS (SQL Server Integration Services)\nPentaho\nJasperSoft\nApache Airflow\n\nPodcast.__init__ Episode\n\n\nApache NiFi\n\nPodcast Episode\n\n\nLuigi\nDagster\n\nPodcast Episode\n\n\nPrefect\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with a DataDog engineer about how they build reliable and highly available systems for processing timeseries data in real time and at massive scale","date_published":"2019-12-30T15:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/810a11e1-deb6-4a0d-bdf1-193539e289b6.mp3","mime_type":"audio/mpeg","size_in_bytes":32735109,"duration_in_seconds":2754}]},{"id":"podlove-2019-12-23t01:33:53+00:00-a907f5f913970a9","title":"Building The Materialize Engine For Interactive Streaming Analytics In SQL","url":"https://www.dataengineeringpodcast.com/materialize-streaming-analytics-episode-112","content_text":"Summary\nTransactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Materialize is and the problems that you are aiming to solve with it?\n\nWhat was your motivation for creating it?\n\n\nWhat use cases does Materialize enable?\n\nWhat are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?\nHow does it fit into the broader ecosystem of data tools and platforms?\n\n\nWhat are some of the use cases that Materialize is uniquely able to support?\nHow is Materialize architected and how has the design evolved since you first began working on it?\nMaterialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?\n\nWhat are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?\n\n\nIn the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?\n\nA majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?\n\n\nCan you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?\n\nWhat are the considerations and constraints on maintaining the full history of the source data within Materialize?\n\n\nFor someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?\nWhat is the workflow for defining and maintaining a set of views?\n\nWhat are some of the complexities that users might face in ensuring the ongoing functionality of those views?\nFor someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?\n\n\nThe Materialize product is currently pre-release. What are the remaining steps before launching it?\n\nWhat do you have planned for the future of the product and company?\n\n\n\nContact Info\n\nfrankmcsherry on GitHub\n@frankmcsherry on Twitter\nBlog\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMaterialize\nTimely Dataflow\nDryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks\n[Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System\nDifferential Privacy\nPageRank\nData Council Presentation on Materialize\nChange Data Capture\nDebezium\nApache Spark\n\nPodcast Episode\n\n\nFlink\n\nPodcast Episode\n\n\nGo language\nRust\nHaskell\nRust Borrow Checker\nGDB (GNU Debugger)\nAvro\nApache Calcite\nANSI SQL 92\nCorrelated Subqueries\nOOM (Out Of Memory) Killer\nLog-Structured Merge Tree\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An episode about building Materialize for interactive analytics on continuously updated streams of data","date_published":"2019-12-22T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/932ceedf-544b-4214-947a-f9de52f9e9cf.mp3","mime_type":"audio/mpeg","size_in_bytes":36858910,"duration_in_seconds":2887}]},{"id":"podlove-2019-12-16t13:31:34+00:00-e84ebedc856675f","title":"Solving Data Lineage Tracking And Data Discovery At WeWork","url":"https://www.dataengineeringpodcast.com/marquez-data-lineage-episode-111","content_text":"Summary\nBuilding clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject \"Data Engineering Podcast\" to get a hands-on demo from one of their data experts.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Marquez is?\n\nWhat was missing in existing metadata management platforms that necessitated the creation of Marquez?\n\n\nHow do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?\n\nHow does it compare to the Amundsen platform that Lyft recently released?\n\n\nWhat are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?\nWhat are some of the capabilities that are unique to Marquez and how are you using them at WeWork?\nWhat are the primary resource types that you support in Marquez?\n\nWhat are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?\n\n\nCan you explain how Marquez is architected and how the design has evolved since you first began working on it?\n\nMany metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?\n\nWhat are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?\n\n\n\n\nHow is the metadata itself stored and managed in Marquez?\n\nHow much up-front data modeling is necessary and what types of schema representations are supported?\n\n\nCan you talk through the overall workflow of someone using Marquez in their environment?\n\nWhat is involved in registering and updating datasets?\nHow do you define and track the health of a given dataset?\nWhat are some of the interesting questions that can be answered from the information stored in Marquez?\n\n\nWhat were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?\nFor someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?\nWhat have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?\nWhen is Marquez the wrong choice for a metadata repository?\nWhat do you have planned for the future of Marquez?\n\nContact Info\n\nJulien Le Dem\n\n@J_ on Twitter\nEmail\njulienledem on GitHub\n\n\nWilly\n\nLinkedIn\n@wslulciuc on Twitter\nwslulciuc on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMarquez\n\nDataEngConf Presentation\n\n\nWeWork\nCanary\nYahoo\nDremio\nHadoop\nPig\nParquet\n\nPodcast Episode\n\n\nAirflow\nApache Atlas\nAmundsen\n\nPodcast Episode\n\n\nUber DataBook\nLinkedIn DataHub\nIceberg Table Format\n\nPodcast Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nGreat Expectations data pipeline unit testing framework\n\nPodcast.__init__ Episode\n\n\nRedshift\nSnowflakeDB\n\nPodcast Episode\n\n\nApache Kafka Schema Registry\n\nPodcast Episode\n\n\nOpen Tracing\nJaeger\nZipkin\nDropWizard Java framework\nMarquez UI\nCayley Graph Database\nKubernetes\nMarquez Helm Chart\nMarquez Docker Container\nDagster\n\nPodcast Episode\n\n\nLuigi\nDBT\n\nPodcast Episode\n\n\nThrift\nProtocol Buffers\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n\n\n","content_html":"

Summary

\n

Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n
\n\n

\"\"

","summary":"An interview about how the Marquez platform for metadata management powers data lineage tracking, data discovery, and health reporting at WeWork","date_published":"2019-12-16T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1819a750-deb8-4f89-93cb-9e955d0cfcb8.mp3","mime_type":"audio/mpeg","size_in_bytes":42906223,"duration_in_seconds":3712}]},{"id":"podlove-2019-12-09t01:48:39+00:00-198edbbd5dd5e28","title":"SnowflakeDB: The Data Warehouse Built For The Cloud","url":"https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110","content_text":"Summary\nData warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?\n\nHow does it compare to the other available platforms for data warehousing?\nHow does it differ from traditional data warehouses?\n\nHow does the performance and flexibility affect the data modeling requirements?\n\n\n\n\nSnowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces?\nCan you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?\n\nWhat are some of the current limitations that you are struggling with?\n\n\nFor someone getting started with Snowflake what is involved with loading data into the platform?\n\nWhat is their workflow for allocating and scaling compute capacity and running anlyses?\n\n\nOne of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen?\nWhat are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about?\nWhen is SnowflakeDB the wrong choice?\nWhat are some of the plans for the future of SnowflakeDB?\n\nContact Info\n\nLinkedIn\nWebsite\n@KentGraziano on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nSnowflakeDB\n\nFree Trial\nStack Overflow\n\n\nData Warehouse\nOracle DB\nMPP == Massively Parallel Processing\nShared Nothing Architecture\nMulti-Cluster Shared Data Architecture\nGoogle BigQuery\nAWS Redshift\nAWS Redshift Spectrum\nPresto\n\nPodcast Episode\n\n\nSnowflakeDB Semi-Structured Data Types\nHive\nACID == Atomicity, Consistency, Isolation, Durability\n3rd Normal Form\nData Vault Modeling\nDimensional Modeling\nJSON\nAVRO\nParquet\nSnowflakeDB Virtual Warehouses\nCRM == Customer Relationship Management\nMaster Data Management\n\nPodcast Episode\n\n\nFoundationDB\n\nPodcast Episode\n\n\nApache Spark\n\nPodcast Episode\n\n\nSSIS == SQL Server Integration Services\nTalend\nInformatica\nFivetran\n\nPodcast Episode\n\n\nMatillion\nApache Kafka\nSnowpipe\nSnowflake Data Exchange\nOLTP == Online Transaction Processing\nGeoJSON\nSnowflake Documentation\nSnowAlert\nSplunk\nData Catalog\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era","date_published":"2019-12-08T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b6059df0-e15b-44b5-a5be-57a21b9557b2.mp3","mime_type":"audio/mpeg","size_in_bytes":41520874,"duration_in_seconds":3536}]},{"id":"podlove-2019-12-03t03:04:51+00:00-8b474fc51ae7fb3","title":"Organizing And Empowering Data Engineers At Citadel","url":"https://www.dataengineeringpodcast.com/citadel-data-engineering-episode-109","content_text":"Summary\nThe financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the size and structure of the data engineering teams at Citadel?\n\nHow have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated?\n\n\nCan you describe the types of data that you are working with at Citadel?\n\nWhat is the process for identifying, evaluating, and ingesting new sources of data?\n\n\nWhat are some of the common core aspects of your data infrastructure?\n\nWhat are some of the ways that it differs across teams or projects?\n\n\nHow involved are data engineers in the overall product design and delivery lifecycle?\nFor someone who joins your team as a data engineer, what are some of the options available to them for a career path?\nWhat are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel?\nWhat are some tools or practices that you are excited to try out?\n\nContact Info\n\nMichael\n\nLinkedIn\n@detroitcoder on Twitter\ndetroitcoder on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nCitadel\nPython\nHedge Fund\nQuantitative Trading\nCitadel Securities\nApache Airflow\nJupyter Hub\nAlembic database migrations for SQLAlchemy\nTerraform\nDQM == Data Quality Management\nGreat Expectations\n\nPodcast.__init__ Episode\n\n\nNomad\nRStudio\nActive Directory\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about building a successful data team and managing their career growth to power a successful financial business","date_published":"2019-12-02T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/33841f91-45ab-4fb0-ad42-438d2dd74d85.mp3","mime_type":"audio/mpeg","size_in_bytes":34789257,"duration_in_seconds":2750}]},{"id":"podlove-2019-11-26t12:11:31+00:00-175c1abe1baaa10","title":"Building A Real Time Event Data Warehouse For Sentry","url":"https://www.dataengineeringpodcast.com/snuba-event-data-warehouse-episode-108","content_text":"Summary\nThe team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities?\n\nWhat did the previous system look like?\n\n\nWhat was your design criteria for building a new platform?\n\nWhat was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse?\n\n\nCan you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work?\n\nWhat have been some of the sharp edges of Clickhouse that you have had to engineer around?\nHow have you found the operational aspects of Clickhouse?\n\n\nHow did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data?\nWhat are some of the downstream benefits of using Clickhouse for managing event data at Sentry?\nFor someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts?\nWhat are some of the other data challenges that you are currently facing at Sentry?\n\nWhat is your next highest priority for evolving or rebuilding to address technical or business challenges?\n\n\n\nContact Info\n\nJames\n\n@JTCunning on Twitter\nJTCunning on GitHub\n\n\nTed\n\ntkaemming on GitHub\nWebsite\n@tkaemming on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSentry\n\nPodcast.__init__ Episode\n\n\nSnuba\n\nBlog Post\n\n\nClickhouse\n\nPodcast Episode\n\n\nDisqus\nUrban Airship\nHBase\nGoogle Bigtable\nPostgreSQL\nRedis\nHyperLogLog\nRiak\nCelery\nRabbitMQ\nApache Spark\nPresto\nCassandra\nApache Kudu\nApache Pinot\nApache Druid\nFlask\nApache Kafka\nCassandra Tombstone\nSentry Blog\nXML\nChange Data Capture\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how Sentry used Clickhouse to build an event data warehouse and pay down their architecture debt","date_published":"2019-11-26T07:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/59e10ae1-6a3d-4bbb-8bf9-5769618bef6e.mp3","mime_type":"audio/mpeg","size_in_bytes":43964845,"duration_in_seconds":3675}]},{"id":"podlove-2019-11-18t22:18:12+00:00-0b710db63dde324","title":"Escaping Analysis Paralysis For Your Data Platform With Data Virtualization","url":"https://www.dataengineeringpodcast.com/atscale-data-virtualization-episode-107","content_text":"Summary\nWith the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.\nHaving all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the AtScale platform and how it fits in the ecosystem of data tools?\nWhat was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success?\nHow is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it?\n\nHow has the surrounding data ecosystem changed since AtScale was founded?\nHow are current industry trends influencing your product focus?\n\n\nCan you talk through the workflow for someone implementing AtScale?\nWhat are some of the main use cases that benefit from data virtualization capabilities?\n\nHow does it influence the relevancy of data warehouses or data lakes?\n\n\nWhat are some of the types of tools or patterns that AtScale replaces in a data platform?\nWhat are some of the most interesting or unexpected ways that you have seen AtScale used?\nWhat have been some of the most challenging aspects of building and growing the platform?\nWhen is AtScale the wrong choice?\nWhat do you have planned for the future of the platform and business?\n\nContact Info\n\nLinkedIn\n@zetty on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nAtScale\nPeopleSoft\nOracle\nHadoop\nPrestoDB\nImpala\nApache Kylin\nApache Druid\nGo Language\nScala\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about data virtualization and data engineering automation with AtScale and the value of abstractions for your data platform architecture","date_published":"2019-11-18T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2363a684-d521-49a7-9f66-4eb9b2b646e9.mp3","mime_type":"audio/mpeg","size_in_bytes":39282858,"duration_in_seconds":3342}]},{"id":"podlove-2019-11-11t17:16:15+00:00-0c65a3202e6bc2d","title":"Designing For Data Protection","url":"https://www.dataengineeringpodcast.com/data-protection-regulations-episode-106","content_text":"Summary\nThe practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.\nHaving all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what is encompassed by the idea of data protection?\n\nWhat regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules?\n\n\nWhat are some of the conflicts and constraints that act against our efforts to implement data protection?\nHow much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements?\nCan you give some examples of the types of information that are subject to data protection?\nOne of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information?\nA corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems?\nWhat are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection?\nHow do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer?\nWho in the organization is responsible for the proper compliance to GDPR and other data protection regimes?\nDownstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use?\n\nContact Info\n\nKaren\n\nEmail\nWebsite\n\n\nMark\n\nEmail\nWebsite\nGDPR Now Podcast\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nData Protection\nGDPR\nThis Is DPO\nIntellectual Property\nEuropean Convention Of Human Rights\nCCPA == California Consumer Privacy Act\nPII == Personally Identifiable Information\nPrivacy By Design\nUS Privacy Shield\nPrinciple of Least Privilege\nInternational Association of Privacy Professionals\n\nPrivacy Technology Vendor Report\n\n\nData Provenance\nChief Data Officer\nUK ICO (Information Commissioner’s Office)\n\nAI Audit Framework\n\n\nData Council\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about data protection regulations and how they can influence the design of your data platform","date_published":"2019-11-11T17:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/56a2a080-b6b6-4c32-b073-45ec8d95b17c.mp3","mime_type":"audio/mpeg","size_in_bytes":36398038,"duration_in_seconds":3083}]},{"id":"podlove-2019-11-04t13:09:15+00:00-8fff1ff3be9857e","title":"Automating Your Production Dataflows On Spark","url":"https://www.dataengineeringpodcast.com/ascend-dataflow-automation-episode-105","content_text":"Summary\nAs data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more.\nHaving all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what the Ascend platform is?\n\nWhat was your inspiration for creating it and what keeps you motivated?\n\n\nWhat was your criteria for determining the best execution substrate for the Ascend platform?\n\nCan you describe any limitations that are imposed by your selection of Spark as the processing engine?\nIf you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it?\n\n\nCan you describe the technical implementation of Ascend?\n\nHow has the system design evolved since you first began working on it?\nWhat are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers?\n\n\nHow does the programming interface for Ascend differ from that of a vanilla Spark deployment?\n\nWhat are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment?\n\n\nHow do you enforce the lack of side effects in the transforms that comprise the dataflow?\nCan you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers?\nWhat are some of the most challenging aspects of building and launching Ascend that you have dealt with?\n\nWhat are some of the most interesting or unexpected lessons learned or edge cases that you have encountered?\n\n\nWhat are some of the capabilities that you are most proud of and which have gained the greatest adoption?\nWhat are some of the sharp edges that remain in the platform?\n\nWhen is Ascend the wrong choice?\n\n\nWhat do you have planned for the future of Ascend?\n\nContact Info\n\nLinkedIn\n@seanknapp on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nAscend\nKubernetes\nBigQuery\nApache Spark\nApache Beam\nGo Language\nSHA Hashes\nPySpark\nDelta Lake\n\nPodcast Episode\n\n\nDAG == Directed Acyclic Graph\nPrestoDB\nMinIO\n\nPodcast Episode\n\n\nParquet\nSnappy Compression\nTensorflow\nKafka\nDruid\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Ascend platform provides an autonomous data orchestration platform to simplify your production dataflows","date_published":"2019-11-04T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/62566215-9058-4f12-97a5-251f494d8ac4.mp3","mime_type":"audio/mpeg","size_in_bytes":35222872,"duration_in_seconds":2930}]},{"id":"podlove-2019-10-28t01:14:25+00:00-b5ec9b958bf531c","title":"Build Maintainable And Testable Data Applications With Dagster","url":"https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104","content_text":"Summary\nDespite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Dagster is and the origin story for the project?\nIn the tagline for Dagster you describe it as \"a system for building modern data applications\". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term \"data application\" means, and the types of use cases that Dagster is well suited for?\nCan you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?\n\nWhat do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?\nWhat are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?\n\n\nFor someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?\nFor someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?\n\nOnce they have something working, what is involved in deploying it?\n\n\nOne of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or \"solids\". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?\nAnother difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?\nIt is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?\nCan you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?\nWhat are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?\nYour recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?\nYou have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?\nWhat should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?\n\nWhat is on your roadmap that you consider necessary before creating a 1.0 release?\n\n\n\nContact Info\n\n@schrockn on Twitter\nschrockn on GitHub\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDagster\nElementl\nETL\nGraphQL\nReact\nMatei Zaharia\nDataOps Episode\nKafka\nFivetran\n\nPodcast Episode\n\n\nSpark\nSupervised Learning\nDevOps\nLuigi\nAirflow\nDask\n\nPodcast Episode\n\n\nKubernetes\nRay\nMaxime Beauchemin\n\nPodcast Interview\n\n\nDagster Testing Guide\nGreat Expectations\n\nPodcast.__init__ Interview\n\n\nPapermill\n\nNotebooks At Netflix Episode\n\n\nDBT\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Dagster framework and how you can use it to build testable and maintainable data applications","date_published":"2019-10-28T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bad64c0f-c059-4321-8b65-5ff7bbb7bf14.mp3","mime_type":"audio/mpeg","size_in_bytes":52414461,"duration_in_seconds":4069}]},{"id":"podlove-2019-10-22t01:58:44+00:00-ab9be7f56e2f506","title":"Data Orchestration For Hybrid Cloud Analytics","url":"https://www.dataengineeringpodcast.com/data-orchestration-hybrid-cloud-episode-102","content_text":"Summary\nThe scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what you mean by the term \"Data Orchestration\"?\n\nHow does it compare to the concept of \"Data Virtualization\"?\nWhat are some of the tools and platforms that fit under that umbrella?\n\n\nWhat are some of the motivations for organizations to use the cloud for their data oriented workloads?\n\nWhat are they giving up by using cloud resources in place of on-premises compute?\n\n\nFor businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments?\nWhat are some of the common patterns for cloud migration projects and what challenges do they present?\n\nDo you have advice on useful metrics to track for determining project completion or success criteria?\n\n\nHow do businesses approach employee education for designing and implementing effective systems for achieving their migration goals?\nCan you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?\n\nWhat are some of the common pain points that organizations encounter when working on hybrid implementations?\n\n\nWhat are some of the missing pieces in the data orchestration landscape?\n\nAre there any efforts that you are aware of that are aiming to fill those gaps?\n\n\nWhere is the data orchestration market heading, and what are some industry trends that are driving it?\n\nWhat projects are you most interested in or excited by?\n\n\nFor someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?\n\nContact Info\n\nLinkedIn\n@dborkar on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nAlluxio\n\nPodcast Episode\n\n\nUC San Diego\nCouchbase\nPresto\n\nPodcast Episode\n\n\nSpark SQL\nData Orchestration\nData Virtualization\nPyTorch\n\nPodcast.init Episode\n\n\nRook storage orchestration\nPySpark\nMinIO\n\nPodcast Episode\n\n\nKubernetes\nOpenstack\nHadoop\nHDFS\nParquet Files\n\nPodcast Episode\n\n\nORC Files\nHive Metastore\nIceberg Table Format\n\nPodcast Episode\n\n\nData Orchestration Summit\nStar Schema\nSnowflake Schema\nData Warehouse\nData Lake\nTeradata\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the emerging category of data orchestration platforms and how they can be used to bridge the gap between modern and legacy analytics systems","date_published":"2019-10-21T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f90074fa-4e20-4236-bf16-51fc203d252c.mp3","mime_type":"audio/mpeg","size_in_bytes":32297687,"duration_in_seconds":2571}]},{"id":"podlove-2019-10-14t23:52:54+00:00-c6a033fa1241850","title":"Keeping Your Data Warehouse In Order With DataForm","url":"https://www.dataengineeringpodcast.com/dataform-data-warehouse-management-episode-102","content_text":"Summary\nManaging a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.\nAre you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\n\n\nCan you start by explaining what DataForm is and the origin story for the platform and company?\n\nWhat are the main benefits of using a tool like DataForm and who are the primary users?\n\n\n\nCan you talk through the workflow for someone using DataForm and highlight the main features that it provides?\n\n\nWhat are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?\n\n\nHow does CI/CD and change management manifest in the context of data warehouse management?\n\n\nHow is the Dataform SDK itself implemented and how has it evolved since you first began working on it?\n\nCan you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other?\n\n\n\nWhat was your selection process for an embedded runtime and how did you decide on javascript?\n\nCan you talk through some of the use cases that having an embedded runtime enables?\nWhat are the limitations of SQL when working in a collaborative environment?\n\n\n\nWhich database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities?\n\n\nWhat is involved in adding support for a new backend?\n\n\nWhen is DataForm the wrong choice?\n\n\nWhat do you have planned for the future of DataForm?\n\n\nContact Info\n\nLinkedIn\n@lewishemens on Twitter\nlewish on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nDataForm\nYCombinator\nDBT == Data Build Tool\n\nPodcast Episode\n\n\nFishtown Analytics\nTypescript\nContinuous Integration\nContinuous Delivery\nBigQuery\nSnowflake DB\nUDF == User Defined Function\nRedShift\nPostgreSQL\n\nPodcast Episode\n\n\nAWS Athena\nPresto\n\nPodcast Episode\n\n\nApache Beam\nApache Kafka\nSegment\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about Dataform and how it helps you to keep your data warehouse in good working order","date_published":"2019-10-14T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/26e0b188-e058-4920-8bfb-eaa6a0c68c74.mp3","mime_type":"audio/mpeg","size_in_bytes":35892878,"duration_in_seconds":2824}]},{"id":"podlove-2019-10-08t00:31:01+00:00-e400a5e37bc761d","title":"Fast Analytics On Semi-Structured And Structured Data In The Cloud","url":"https://www.dataengineeringpodcast.com/rockset-serverless-analytics-episode-101","content_text":"Summary\nThe process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nThis week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Rockset is and your motivation for creating it?\n\nWhat are some of the use cases that it enables which would otherwise be impractical or intractable?\n\n\nHow does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace?\nCan you describe how the Rockset platform is architected and how it has evolved as you onboard more customers?\nCan you describe the flow of a piece of data as it traverses the full lifecycle in Rockset?\nHow is your storage backend implemented to allow for speed and flexibility in the query layer?\n\nHow does it manage distribution, balancing, and durability of the data?\nWhat are your strategies for handling node and region failure in the cloud?\n\n\nYou have a whitepaper describing your architecture as being oriented around microservices on Kubernetes in order to be cloud agnostic. How do you handle the case where customers have data sources that span multiple cloud providers or regions and the latency that can result?\nHow is the query engine structured to allow for optimizing so many different query types (e.g. search, graph, timeseries, etc.)?\nWith Rockset handling a large portion of the underlying infrastructure work that a data engineer might be involved with, what are some ways that you have seen them use the time that they have gained and how has that benefitted the organizations that they work for?\nWhat are some of the most interesting/unexpected/innovative ways that you have seen Rockset used?\nWhen is Rockset the wrong choice for a given project?\nWhat have you found to be the most challenging and the most exciting aspects of building the Rockset platform and company?\nWhat do you have planned for the future of Rockset?\n\nContact Info\n\nVenkat\n\nLinkedIn\n@iamveeve on Twitter\nveeve on GitHub\n\n\nShruti\n\nLinkedIn\n@shrutibhat on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at pythonpodcast.com/chat\n\nLinks\n\nRockset\n\nBlog\n\n\nOracle\nVMWare\nFacebook\nRube Goldberg Machine\nSnowflakeDB\nProtocol Buffers\nSpark\n\nPodcast Episode\n\n\nPresto\n\nPodcast Episode\n\n\nApache Kafka\nRocksDB\nInnoDB\nLucene\nLog Structured Merge Tree (LSTM)\nKubernetes\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the architecture of Rockset and how they built a serverless platform for fast and flexible analytics on your semi-structured data","date_published":"2019-10-07T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/656200ab-874b-480a-991d-2b2e4b19113a.mp3","mime_type":"audio/mpeg","size_in_bytes":37563102,"duration_in_seconds":3278}]},{"id":"podlove-2019-09-29t10:39:24+00:00-8bb0fa3f1dd0515","title":"Ship Faster With An Opinionated Data Pipeline Framework","url":"https://www.dataengineeringpodcast.com/kedro-data-pipeline-episode-100","content_text":"Summary\nBuilding an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Kedro is and its origin story?\nWho are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists?\n\nCan you talk through a typical lifecycle for a project that is built using Kedro?\n\n\nWhat are the overall features of Kedro and how do they compound to encourage best practices for data projects?\nHow does the culture and background of QuantumBlack influence the design and capabilities of Kedro?\n\nWhat was the motivation for releasing it publicly as an open source framework?\n\n\nWhat are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project?\nCan you describe how Kedro itself is implemented and how it has evolved since you first started working on it?\nThere has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term?\nHow do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster?\nIt has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated?\n\nWhat is still missing for a stable 1.x release?\n\n\nWhat are some of the most interesting/innovative/unexpected ways that you have seen Kedro used?\nWhen is Kedro the wrong choice?\nWhat do you have in store for the future of Kedro?\n\nContact Info\n\nLinkedIn\n@tomgoldenberg on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nKedro\n\nGitHub\n\n\nQuantum Black Labs\n\nGitHub\n\n\nAgolo\nMcKinsey\nAirflow\nDocker\nKubernetes\nDataBricks\nFormula 1\nKedro Viz\nDask\n\nPodcast Interview\n\n\nPy.Test\nAzure Data Factory\nPrefect\n\nPodcast Interview\n\n\nDagster\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the open source Kedro framework makes it faster and easier to build your end-to-end data pipeline for machine learning projects","date_published":"2019-09-30T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a080e44d-c8c6-4b90-8de5-c93894d36e37.mp3","mime_type":"audio/mpeg","size_in_bytes":30951744,"duration_in_seconds":2108}]},{"id":"podlove-2019-09-23t02:27:25+00:00-caed5ef25efecc9","title":"Open Source Object Storage For All Of Your Data","url":"https://www.dataengineeringpodcast.com/minio-object-storage-episode-99","content_text":"Summary\nObject storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you explain what MinIO is and its origin story?\nWhat are some of the main use cases that MinIO enables?\nHow does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?\n\nYour marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)\n\n\nWhat are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?\n\nWhat are the constraints and opportunities that are provided by adhering to that API?\n\n\nCan you describe how MinIO is implemented and the overall system design?\n\nHow has that design evolved since you first began working on it?\n\nWhat assumptions did you have at the outset and how have they been challenged or updated?\n\n\n\n\nWhat are the axes for scaling that MinIO provides and how does it handle clustering?\n\nWhere does it fall on the axes of availability and consistency in the CAP theorem?\n\n\nOne of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume?\nFor someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it?\nWhat are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage?\nHow do you approach project governance and sustainability?\nWhat are some of the most interesting/innovative/unexpected ways that you have seen MinIO used?\nWhat do you have planned for the future of MinIO?\n\nContact Info\n\nLinkedIn\n@abperiasamy on Twitter\nabperiasamy on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMinIO\nGlusterFS\nObject Storage\nRedHat\nBionics\nAWS S3\nCeph\nSwift Stack\nPOSIX\nHDFS\nGoogle BigQuery\nAzureML\nAWS SageMaker\nAWS Athena\nS3 Select\nAzure Blob Store\nBackBlaze\nRound Robin DNS\nService Mesh\nIstio\nEnvoy\nSmartStack\nFree Software\nRocksDB\nTanTan Blog Post\nPresto\nSparkML\nMCAdmin Trace\nDTrace\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview on the open source MinIO platform for fast and flexible object storage for data intensive applications and analytics that runs everywhere","date_published":"2019-09-22T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a40d973f-a9d6-4165-be12-fc481a9e985d.mp3","mime_type":"audio/mpeg","size_in_bytes":59278453,"duration_in_seconds":4099}]},{"id":"podlove-2019-09-17t22:46:10+00:00-d2022b8e671c0f3","title":"Navigating Boundless Data Streams With The Swim Kernel","url":"https://www.dataengineeringpodcast.com/swimos-data-streams-episode-98","content_text":"Summary\nThe conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nListen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Swim.ai is and how the project and business got started?\n\nCan you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?\n\n\nWhat are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?\nHow does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?\nCan you describe a typical design for an application or system being built on top of the Swim platform?\n\nWhat does the developer workflow look like?\n\nWhat kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?\n\n\n\n\nCan you describe the internal design for the SwimOS and how it has evolved since you first began working on it?\nFor such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?\n\nWhat mechanisms are in place to account for network failures?\n\n\nSince the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?\nSince there is no explicit data layer, how is data redundancy handled by Swim applications?\nWhat are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?\nWhat have you found to be the most challenging aspects of building the Swim platform?\nWhat are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?\nWhat do you have planned for the future of the technical and business aspects of Swim.ai?\n\nContact Info\n\nLinkedIn\nWikipedia\n@simoncrosby on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nSwim.ai\nHadoop\nStreaming Data\nApache Flink\n\nPodcast Episode\n\n\nApache Kafka\nWallaroo\n\nPodcast Episode\n\n\nDigital Twin\nSwim Concepts Documentation\nRFID == Radio Frequency IDentification\nPCB == Printed Circuit Board\nGraal VM\nAzure IoT Edge Framework\nAzure DLS (Data Lake Storage)\nPower BI\nWARP Protocol\nLightBend\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about using stateful computation on data streams with the SwimOS kernel to improve your analytics","date_published":"2019-09-18T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/9f5436e7-121f-45c7-aa40-893357b037af.mp3","mime_type":"audio/mpeg","size_in_bytes":37907960,"duration_in_seconds":3475}]},{"id":"podlove-2019-09-10t01:18:36+00:00-d4eeca60430d1e0","title":"Building A Reliable And Performant Router For Observability Data","url":"https://www.dataengineeringpodcast.com/vector-observability-data-router-episode-97","content_text":"Summary\nThe first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what the Vector project is and your reason for creating it?\n\nWhat are some of the comparable tools that are available and what were they lacking that prompted you to start a new project?\n\n\nWhat strategy are you using for project governance and sustainability?\nWhat are the main use cases that Vector enables?\nCan you explain how Vector is implemented and how the system design has evolved since you began working on it?\n\nHow did your experience building the business and products for Timber influence and inform your work on Vector?\nWhen you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust?\nWhat led you to choose Lua as the embedded scripting environment?\n\n\nWhat data format does Vector use internally?\n\nIs there any support for defining and enforcing schemas?\n\nIn the event of a malformed message is there any capacity for a dead letter queue?\n\n\n\n\nWhat are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data?\nWhen designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations?\nWhat options are available to operators to support visibility into the running system?\nIn terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy?\nWhat are some of the other considerations that operators and administrators of Vector should be considering?\nYou have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap?\nWhat is the available interface for adding and extending the capabilities of Vector? (source/transform/sink)\nWhat are some of the most interesting/innovative/unexpected ways that you have seen Vector used?\nWhat are some of the challenges that you have faced in building/publicizing Vector?\nFor someone who is interested in using Vector, how would you characterize the overall maturity of the project currently?\n\nWhat is missing that you would consider necessary for production readiness?\n\n\nWhen is Vector the wrong choice?\n\nContact Info\n\nBen\n\n@binarylogic on Twitter\nbinarylogic on GitHub\n\n\nLuke\n\nLinkedIn\n@lukesteensen on Twitter\nlukesteensen on GitHub\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nVector\n\nGitHub\n\n\nTimber.io\nObservability\nSeatGeek\nApache Kafka\nStatsD\nFluentD\nSplunk\nFilebeat\nLogstash\nFluent Bit\nRust\nTokio Rust library\nTOML\nLua\nNginx\nHAProxy\nWeb Assembly (WASM)\nProtocol Buffers\nJepsen\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about building the Vector project to unify delivery of logs and metrics for better system observability","date_published":"2019-09-09T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e90d66b1-ce56-4d6d-9339-e41854e68767.mp3","mime_type":"audio/mpeg","size_in_bytes":41481329,"duration_in_seconds":3319}]},{"id":"podlove-2019-09-02t16:23:10+00:00-2428a2f0f19ad87","title":"Building A Community For Data Professionals at Data Council","url":"https://www.dataengineeringpodcast.com/data-council-data-professional-community-episode-96","content_text":"Summary\nData professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nListen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your original reason for focusing your efforts on fostering a community of data engineers?\n\nWhat was the state of recognition in the industry for that role at the time that you began your efforts?\n\n\nThe current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?\n\nHow has the community itself changed and grown over the past few years?\n\n\nCommunities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data?\nWhere do you draw inspiration and direction for how to manage such a large and distributed community?\n\nWhat are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?\n\n\nWhat are some ways that you have been surprised or delighted in your interactions with the data community?\nHow do you approach sustainability of the Data Council community and the organization itself?\nThe tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction?\nIn addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission?\nYou also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?\n\nWhat are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?\n\n\nWhat are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?\n\nWhat are your opinions on venture-backed vs. \"lifestyle\" or bootstrapped businesses?\n\n\nWhat are the characteristics of a data business that you look at when evaluating a potential investment?\nWhat are some of the current industry trends that you are most excited by?\n\nWhat are some that you find concerning?\n\n\nWhat are your goals and plans for the future of Data Council?\n\nContact Info\n\n@petesoder on Twitter\nLinkedIn\n@petesoder on Medium\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nData Council\nDatabase Design For Mere Mortals\nBloomberg\nGarmin\n500 Startups\nGeeks On A Plane\nData Council NYC 2019 Track Summary\nPete’s Angel List Syndicate\nDataOps\n\nData Kitchen Episode\nDataOps Vs DevOps Episode\n\n\nGreat Expectations\n\nPodcast.__init__ Interview\n\n\nElementl\nDagster\n\nData Council Presentation\n\n\nData Council Call For Proposals\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview with Pete Soderling about building and growing the Data Council events and helping engineers build businesses","date_published":"2019-09-02T12:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/2236f2ee-2d57-4615-b200-c7396f256abb.mp3","mime_type":"audio/mpeg","size_in_bytes":40013706,"duration_in_seconds":3166}]},{"id":"podlove-2019-08-26t15:26:15+00:00-ed3d2f40e1ce2e3","title":"Building Tools And Platforms For Data Analytics","url":"https://www.dataengineeringpodcast.com/data-analytics-data-platforms-episode-95","content_text":"Summary\nData engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nYour host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing some of the main features that you are looking for in the tools that you use?\nWhat are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack?\nWhat should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists?\n\nIn terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict?\n\n\nIn terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)?\nWhat are some anti-patterns that data engineers can guard against as they are designing their pipelines?\nIn your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with?\nHow much understanding of analytics are necessary for data engineers to be successful in their projects and careers?\n\nConversely, how much understanding of data management should analysts have?\n\n\nWhat are the industry trends that you are most excited by as an analyst?\n\nContact Info\n\nLinkedIn\n@bennstancil on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nMode Analytics\nData Council Presentation\nYammer\nStitchFix Blog Post\nSnowflakeDB\nRe:Dash\nSuperset\nMarquez\nAmundsen\n\nPodcast Episode\n\n\nElementl\nDagster\n\nData Council Presentation\n\n\nDBT\n\nPodcast Episode\n\n\nGreat Expectations\n\nPodcast.__init__ Episode\n\n\nDelta Lake\n\nPodcast Episode\n\n\nStitch\nFivetran\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Closing Announcements

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview on what data engineers need to know about building tools and platforms for data analytics","date_published":"2019-08-26T11:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fcff389d-1a7e-4384-8000-432ffd87eea4.mp3","mime_type":"audio/mpeg","size_in_bytes":38659024,"duration_in_seconds":2886}]},{"id":"podlove-2019-08-19t13:54:22+00:00-35d1b1790adb4c9","title":"A High Performance Platform For The Full Big Data Lifecycle","url":"https://www.dataengineeringpodcast.com/hpcc-big-data-platform-episode-94","content_text":"Summary\nManaging big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nTo connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?\n\nWhat was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?\n\n\nCan you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?\nGiven how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?\nFor someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?\n\nHow does HPCC Systems manage persistence and scalability?\n\n\nWhat are the integration points available for extending and enhancing the HPCC Systems platform?\nWhat is involved in deploying and managing a production installation of HPCC Systems?\nThe ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?\nHow does the Thor engine manage data transformation and manipulation?\n\nWhat are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?\n\n\nFor extraction and analysis of data can you talk through the capabilities of the Roxie engine?\nHow are you using the HPCC Systems platform in your work at LexisNexis?\nDespite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?\nHow is the HPCC Systems project governed, and what is your approach to sustainability?\n\nWhat are some of the additional capabilities that are only available in the enterprise distribution?\n\n\nWhen is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?\nWhat have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?\nWhat are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?\nWhat do you have planned for the future of HPCC Systems?\n\nContact Info\n\nLinkedIn\n@fvillanustre on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nHPCC Systems\nLexisNexis Risk Solutions\nRisk Management\nHadoop\nMapReduce\nSybase\nOracle DB\nAbInitio\nData Lake\nSQL\nECL\nDataFlow\nTensorFlow\nECL IDE\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the HPCC Systems platform, its journey to open source, and how it handle the full lifecycle of big data for enterprise scale analytics","date_published":"2019-08-19T14:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/56da6267-16aa-4fb0-b278-9f8c9be90f49.mp3","mime_type":"audio/mpeg","size_in_bytes":58601369,"duration_in_seconds":4425}]},{"id":"podlove-2019-08-12t01:43:36+00:00-b5ac3e15cba559e","title":"Digging Into Data Replication At Fivetran","url":"https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93","content_text":"Summary\nThe extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the problem that Fivetran solves and the story of how it got started?\nIntegration of multiple data sources (e.g. entity resolution)\nHow is Fivetran architected and how has the overall system design changed since you first began working on it?\nmonitoring and alerting\nAutomated schema normalization. How does it work for customized data sources?\nManaging schema drift while avoiding data loss\nChange data capture\nWhat have you found to be the most complex or challenging data sources to work with reliably?\nWorkflow for users getting started with Fivetran\nWhen is Fivetran the wrong choice for collecting and analyzing your data?\nWhat have you found to be the most challenging aspects of working in the space of data integrations?}}\nWhat have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?\nWhat do you have planned for the future of Fivetran?\n\nContact Info\n\nLinkedIn\n@frasergeorgew on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nFivetran\nRalph Kimball\nDBT (Data Build Tool)\n\nPodcast Interview\n\n\nLooker\n\nPodcast Interview\n\n\nCron\nKubernetes\nPostgres\n\nPodcast Episode\n\n\nOracle DB\nSalesforce\nNetsuite\nMarketo\nJira\nAsana\nCloudwatch\nStackdriver\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Fivetran platform is designed to handle data replication as a service","date_published":"2019-08-12T07:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a58486f9-fa6d-4e74-860f-9a6bab5f3f10.mp3","mime_type":"audio/mpeg","size_in_bytes":33181459,"duration_in_seconds":2680}]},{"id":"podlove-2019-08-05t12:47:42+00:00-9be40bf64893ee1","title":"Solving Data Discovery At Lyft","url":"https://www.dataengineeringpodcast.com/amundsen-data-discovery-episode-92","content_text":"Summary\nData is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.\nAnnouncements\n\nWelcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nFinding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Amundsen is and the problems that it was designed to address?\n\nWhat was lacking in the existing projects at the time that led you to building a new platform from the ground up?\n\n\nHow does Amundsen fit in the larger ecosystem of data tools?\n\nHow does it compare to what WeWork is building with Marquez?\n\n\nCan you describe the overall architecture of Amundsen and how it has evolved since you began working on it?\n\nWhat were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it?\n\n\nWhat has been the impact of Amundsen on the workflows of data teams at Lyft?\nCan you talk through an example workflow for someone using Amundsen?\n\nOnce a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing?\n\n\nHow does the information in Amundsen get populated and what is the process for keeping it up to date?\nWhat was your motivation for releasing it as open source and how much effort was involved in cleaning up the code for the public?\nWhat are some of the capabilities that you have intentionally decided not to implement yet?\nFor someone who wants to run their own instance of Amundsen what is involved in getting it deployed and integrated?\nWhat have you found to be the most challenging aspects of building, using and maintaining Amundsen?\nWhat do you have planned for the future of Amundsen?\n\nContact Info\n\nTao\n\nLinkedIn\nfeng-tao on GitHub\n\n\nMark\n\nLinkedIn\nWebsite\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAmundsen\n\nData Council Presentation\nStrata Presentation\nBlog Post\n\n\nLyft\nAirflow\n\nPodcast.__init__ Episode\n\n\nLinkedIn\nSlack\nMarquez\nS3\nHive\nPresto\n\nPodcast Episode\n\n\nSpark\nPostgreSQL\nGoogle BigQuery\nNeo4J\nApache Atlas\nTableau\nSuperset\nAlation\nCloudera Navigator\nDynamoDB\nMongoDB\nDruid\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the open source Amundsen platform for data discovery and how Lyft is using it to improve their analytics workflow","date_published":"2019-08-05T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7bf4d624-ee62-449a-8240-a2a277ee7970.mp3","mime_type":"audio/mpeg","size_in_bytes":32908856,"duration_in_seconds":3108}]},{"id":"podlove-2019-07-28t01:25:10+00:00-ef12d25ad9ad6a9","title":"Simplifying Data Integration Through Eventual Connectivity","url":"https://www.dataengineeringpodcast.com/eventual-connectivity-data-integration-episode-91","content_text":"Summary\nThe ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nTo connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?\nWhat is eventual connectivity and how does it address the problems with ETL in the current data landscape?\nIn your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?\n\nHow do different implementations of graph databases impact their viability for this use case?\n\n\nCan you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?\nHow much up-front modeling is necessary to make this a viable approach to data integration?\nHow do the volume and format of the source data impact the technology and architecture decisions that you would make?\nWhat are the limitations or edge cases that you have found when using this pattern?\nIn modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?\nWhat resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?\n\nContact Info\n\nEmail\nLinkedIn\n@jerrong on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nEventual Connectivity White Paper\nCluedIn\n\nPodcast Episode\n\n\nCopenhagen\nEwok\nMultivariate Testing\nCRM\nERP\nETL\nELT\nDAG\nGraph Database\nApache NiFi\n\nPodcast Episode\n\n\nApache Airflow\n\nPodcast.init Episode\n\n\nBigQuery\nRedShift\nCosmosDB\nSAP HANA\nIOT == Internet of Things\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets","date_published":"2019-07-28T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/63e828b6-6a77-4e5c-b0d4-aa10a6c86450.mp3","mime_type":"audio/mpeg","size_in_bytes":36399064,"duration_in_seconds":3227}]},{"id":"podlove-2019-07-22t22:02:59+00:00-98fe0560ffae1d8","title":"Straining Your Data Lake Through A Data Mesh","url":"https://www.dataengineeringpodcast.com/zhamak-dehghani-data-mesh-episode-90","content_text":"Summary\nThe current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAnd to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by providing your definition of a \"data lake\" and discussing some of the problems and challenges that they pose?\n\nWhat are some of the organizational and industry trends that tend to lead to this solution?\n\n\nYou have written a detailed post outlining the concept of a \"data mesh\" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?\n\nIn a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?\n\n\nWhat are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?\nOne of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?\n\nA corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?\n\nWho is responsible for implementing and enforcing compliance regimes?\n\n\n\n\nOne of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?\n\nHow do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?\n\nHas latency of data retrieval proven to be an issue in your work?\n\n\n\n\nWhen it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?\n\nHow do team structures and dynamics shift in this scenario?\nWhat are the necessary skills for each team?\n\n\nWho is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?\nIs there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?\nFor an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?\nAre there any other architectural considerations that data professionals should be considering that aren’t yet widespread?\n\nContact Info\n\nLinkedIn\n@zhamakd on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nHow to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh\nThoughtworks\nTechnology Radar\nData Lake\nData Warehouse\nJames Dixon\nAzure Data Lake\n\"Big Ball Of Mud\" Anti-Pattern\nETL\nELT\nHadoop\nSpark\nKafka\nEvent Sourcing\nAirflow\n\nPodcast.__init__ Episode\nData Engineering Episode\n\n\nData Catalog\nMaster Data Management\n\nPodcast Episode\n\n\nPolyseme\nREST\nCNCF (Cloud Native Computing Foundation)\nCloud Events Standard\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the data mesh architectural and organizational pattern can lead to a more maintainable data platform","date_published":"2019-07-22T18:15:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7f02f3a6-37ea-4373-8701-904dc308b33d.mp3","mime_type":"audio/mpeg","size_in_bytes":49993652,"duration_in_seconds":3867}]},{"id":"podlove-2019-07-15t00:57:06+00:00-296598634b6216c","title":"Data Labeling That You Can Feel Good About With CloudFactory","url":"https://www.dataengineeringpodcast.com/cloudfactory-data-labeling-episode-89","content_text":"Summary\nSuccessful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nIntegrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what CloudFactory is and the story behind it?\nWhat are some of the common requirements for feature extraction and data labelling that your customers contact you for?\nWhat integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?\nCan you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?\n\nWhat protocols do you have in place to ensure data quality and identify potential sources of bias?\n\n\nWhat role do humans play in the lifecycle for AI and ML projects?\nI understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?\n\nHow do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?\n\n\nCan you share some stories of cloud workers who have benefited from their experience working with your company?\nWhat are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?\nWhat have been some of the most interesting/unexpected ways that you have seen customers using your platform?\nWhat lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?\nWhat are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?\n\nHow does that tie into your plans for CloudFactory in the medium to long term?\n\n\n\nContact Info\n\n@marktsears on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCloudFactory\nReading, UK\nNepal\nKenya\nRuby on Rails\nKathmandu\nNatural Language Processing (NLP)\nComputer Vision\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Cloud Factory platform for data labeling and social good in developing nations","date_published":"2019-07-14T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d548db7e-74ac-475b-975b-aa9c4dd7a871.mp3","mime_type":"audio/mpeg","size_in_bytes":45402396,"duration_in_seconds":3470}]},{"id":"podlove-2019-07-08t11:46:06+00:00-b311e1ee1848576","title":"Scale Your Analytics On The Clickhouse Data Warehouse","url":"https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88","content_text":"Summary\nThe market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nIntegrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Clickhouse is and how you each got involved with it?\n\nWhat are the primary use cases that Clickhouse is targeting?\nWhere does it fit in the database market and how does it compare to other column stores, both open source and commercial?\n\n\nCan you describe how Clickhouse is architected?\nCan you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?\n\nI noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?\n\n\nAside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?\n\nFor the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?\n\n\nWhat are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?\nFor someone getting started with Clickhouse can you describe how they should be thinking about data modeling?\nRecent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?\nHow is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?\n\nHow is data replication and consistency managed?\n\n\nWhat is involved in deploying and maintaining an installation of Clickhouse?\n\nI noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?\n\n\nWhat are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?\nWhat are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?\nWhat are the shortcomings of Clickhouse and how do you address them at Altinity?\nWhen is Clickhouse the wrong choice?\n\nContact Info\n\nRobert\n\nLinkedIn\nhodgesrm on GitHub\n\n\nAlexander\n\nalex-zaitsev on GitHub\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nClickhouse\nAltinity\nOLAP\nM204\nSybase\nMySQL\nVertica\nYandex\nYandex Metrica\nGoogle Analytics\nSQL\nGreenplum\nInfoBright\nInfiniDB\nMariaDB\nSpark\nSIMD (Single Instruction, Multiple Data)\nMergesort\nETL\nChange Data Capture\nMapReduce\nKDB\nOLTP\nCassandra\nInfluxDB\nPrometheus\nSnowflakeDB\nHive\nHadoop\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about Clickhouse, an open source, columnar data warehouse built for massive scale and speed to enable interactive analytics","date_published":"2019-07-08T11:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6673776d-9c0e-463b-a4bd-a4de7cfb1643.mp3","mime_type":"audio/mpeg","size_in_bytes":53183908,"duration_in_seconds":4278}]},{"id":"podlove-2019-07-02t01:05:24+00:00-0be4cc36e3e06d2","title":"Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection","url":"https://www.dataengineeringpodcast.com/instaclustr-kafka-cassandra-scaling-episode-87","content_text":"Summary\nAnomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nIntegrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing the problem that you were trying to solve and the requirements that you were aiming for?\n\nWhat are some example cases where anomaly detection is useful or necessary?\n\n\nOnce you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?\nWhat was your selection criteria for the various components of your system design?\n\nWhat tools and technologies did you consider in your initial assessment and which did you ultimately converge on?\n\nIf you were to start over today would you do any of it differently?\n\n\n\n\nCan you talk through the algorithm that you used for detecting anomalous activity?\n\nWhat is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?\n\n\nWhat were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?\nWhat were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?\n\nHow did those bottlenecks differ as you moved through different levels of scale?\n\n\nWhat were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?\nWhat were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?\nHow have those lessons fed back to your work at Instaclustr?\n\nContact Info\n\nLinkedIn\n@paulbrebner_ on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nInstaclustr\nKafka\nCassandra\nCanberra, Australia\nSpark\nAnomaly Detection\nKubernetes\nPrometheus\nOpenTracing\nJaeger\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about testing the limits of scaling Kafka and Cassandra for real-time anomaly detection at Instaclustr","date_published":"2019-07-01T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/745d8432-b083-421b-90f9-b8e8558f1e23.mp3","mime_type":"audio/mpeg","size_in_bytes":27104343,"duration_in_seconds":2282}]},{"id":"podlove-2019-06-25t01:12:44+00:00-fafdfbe8c704a47","title":"The Workflow Engine For Data Engineers And Data Scientists","url":"https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86","content_text":"Summary\nBuilding a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Prefect is and your motivation for creating it?\nWhat are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?\nIn some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?\nHow is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?\nHow do you manage passing data between stages in a pipeline when they are running across distributed nodes?\nWhat was your decision making process when deciding to use Dask as your supported execution engine?\n\nFor tasks that require specific resources or dependencies how do you approach the idea of task affinity?\n\n\nDoes Prefect support managing tasks that bridge network boundaries?\nWhat are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?\nWhat are the limitations of the open source core as compared to the cloud offering that you are building?\nWhat were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?\nWhat are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?\nWhen is Prefect the wrong choice?\nIn your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?\n\nWhat are some best practices and industry trends that you are most excited by?\n\n\nWhat do you have planned for the future of the Prefect project and company?\n\nContact Info\n\nLinkedIn\n@jlowin on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPrefect\nAirflow\nDask\n\nPodcast Episode\n\n\nPrefect Blog\nPyData Presentation\nTensorflow\nWorkflow Engine\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the Prefect workflow engine unifies the needs of data engineers and data scientists with a pure Python API","date_published":"2019-06-24T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8724b84d-434b-470e-afae-ba7c76c09633.mp3","mime_type":"audio/mpeg","size_in_bytes":48664028,"duration_in_seconds":4106}]},{"id":"podlove-2019-06-15t12:41:38+00:00-7aafea604b3a4d4","title":"Maintaining Your Data Lake At Scale With Spark","url":"https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85","content_text":"Summary\nBuilding and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAnd to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Delta Lake is and the motivation for creating it?\nWhat are some of the common antipatterns in data lake implementations and how does Delta Lake address them?\n\nWhat are the benefits of a data lake over a data warehouse?\n\nHow has that equation changed in recent years with the availability of modern cloud data warehouses?\n\n\n\n\nHow is Delta lake implemented and how has the design evolved since you first began working on it?\n\nWhat assumptions did you have going into the project and how have they been challenged as it has gained users?\n\n\nOne of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested?\n\nIn your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.)\n\n\nCan you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store)\n\nAre there limits in terms of the volume of data that can be managed within a single transaction?\n\n\nHow does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user?\n\nThe Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach?\n\n\nWhat have been the most difficult/complex/challenging aspects of building Delta Lake?\nHow is the data versioning in Delta Lake implemented?\n\nBy keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process?\n\n\nWhat are the reasons for standardizing on Parquet as the storage format?\n\nWhat are some of the cases where that has led to greater complications?\n\n\nIn addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date?\nWhen is Delta Lake the wrong choice?\n\nWhat problems did you consciously decide not to address?\n\n\nWhat is in store for the future of Delta Lake?\n\nContact Info\n\nLinkedIn\n@michaelarmbrust on Twitter\nmarmbrus on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDelta Lake\nDataBricks\nSpark SQL\nMicrosoft SQL Server\nDatabricks Delta\nSpark Summit\nApache Spark\nEnterprise Data Curation Episode\nData Lake\nData Warehouse\nSnowflakeDB\nBigQuery\nParquet\n\nData Serialization Episode\n\n\nHive Metastore\nGreat Expectations\n\nPodcast.__init__ Interview\n\n\nOptimistic Concurrency/Optimistic Locking\nPresto\nStarburst Labs\n\nPodcast Interview\n\n\nApache NiFi\n\nPodcast Interview\n\n\nTensorflow\nTableau\nChange Data Capture\nApache Pulsar\n\nPodcast Interview\n\n\nPravega\n\nPodcast Interview\n\n\nMulti-Version Concurrency Control\nMLFlow\nAvro\nORC\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A conversation with the architect of Delta Lake on the challenges of building a sustainable data lake at scale","date_published":"2019-06-16T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/83d4dda1-8b35-4a7c-9093-a8bdb3469415.mp3","mime_type":"audio/mpeg","size_in_bytes":39376973,"duration_in_seconds":3050}]},{"id":"podlove-2019-06-08t22:33:03+00:00-71d353ff0e9c3d7","title":"Managing The Machine Learning Lifecycle","url":"https://www.dataengineeringpodcast.com/hydrosphere-machine-learning-lifecycle-episode-84","content_text":"Summary\nBuilding a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAnd to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Hydrosphere is and share its origin story?\nIn your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?\n\nHow does it differ from deployment and maintenance of a regular software application?\n\n\nCan you describe how Hydrosphere is architected and how the different components of the stack fit together?\nFor someone who is using Hydrosphere in their production workflow, what would that look like?\n\nWhat is the difference in interaction with Hydrosphere for different roles within a data team?\n\n\nWhat are some of the types of metrics that you monitor to determine when and how to retrain deployed models?\n\nWhich metrics do you track for testing and verifying the health of the data?\n\n\nWhat are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?\nHow has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?\n\nHow has that influenced the design and direction of Hydrosphere, both as a project and a business?\nHow has the design of Hydrosphere evolved since you first began working on it?\n\n\nWhat assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?\nWhat have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?\nWhat do you have in store for the future of Hydrosphere?\n\nContact Info\n\nLinkedIn\nspushkarev on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nHydrosphere\n\nGitHub\n\n\nData Engineering Podcast at ODSC\nKD Nuggets\n\nBig Data Science: Expectation vs. Reality\n\n\nThe Open Data Science Conference\nScala\nInfluxDB\nRocksDB\nDocker\nKubernetes\nAkka\nPython Pickle\nProtocol Buffers\nKubeflow\nMLFlow\nTensorFlow Extended\nKubeflow Pipelines\nArgo\nAirflow\n\nPodcast.__init__ Interview\n\n\nEnvoy\nIstio\nDVC\n\nPodcast.__init__ Interview\n\n\nGenerative Adversarial Networks\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the open source Hydrosphere platform simplifies management of the full machine learning lifecycle","date_published":"2019-06-09T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/94ec7e9e-1015-4038-a809-43e8b028f0a7.mp3","mime_type":"audio/mpeg","size_in_bytes":45980463,"duration_in_seconds":3759}]},{"id":"podlove-2019-06-04t04:50:50+00:00-084c30d0fe85373","title":"Evolving An ETL Pipeline For Better Productivity","url":"https://www.dataengineeringpodcast.com/greenhouse-etl-pipeline-episode-83","content_text":"Summary\nBuilding an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAnd to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!\nYou listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nAaron, can you start by describing what Greenhouse is and some of the ways that you use data?\nCan you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?\n\nWhat are your primary sources of data and what are the targets that you are loading them into?\n\n\nWhat were your biggest pain points and what motivated you to re-evaluate your approach to ETL?\n\nWhat were your criteria for your replacement technology and how did you gather and evaluate your options?\n\n\nOnce you made the decision to use DataCoral can you talk through the transition and cut-over process?\n\nWhat were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?\nWhat were the big wins?\n\n\nWhat was your evaluation framework for determining whether your re-engineering was successful?\nNow that you are using DataCoral how would you characterize the experiences of yourself and your team?\n\nIf you have freed up time for your engineers, how are you allocating that spare capacity?\n\n\nWhat do you hope to see from DataCoral in the future?\nWhat advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?\n\nContact Info\n\nAaron\n\nagribralter on GitHub\nLinkedIn\n\n\nRaghu\n\nLinkedIn\nMedium\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nGreenhouse\n\nWe’re hiring Data Scientists and Software Engineers!\n\n\nDatacoral\nAirflow\n\nPodcast.init Interview\nData Engineering Interview about running Airflow in production\n\n\nPeriscope Data\nMode Analytics\nData Warehouse\nETL\nSalesforce\nZendesk\nJira\nDataDog\nAsana\nGDPR\nMetabase\n\nPodcast Interview\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how and why Greenhouse migrated their homegrown ETL pipeline onto DataCoral","date_published":"2019-06-04T01:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0edf945c-1544-4f5a-aa3d-d31e2b9bb83c.mp3","mime_type":"audio/mpeg","size_in_bytes":46706994,"duration_in_seconds":3741}]},{"id":"podlove-2019-05-27t02:15:19+00:00-62fe8ba2507c6ce","title":"Data Lineage For Your Pipelines","url":"https://www.dataengineeringpodcast.com/pachyderm-data-lineage-episode-82","content_text":"Summary\nSome problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Pachyderm is and how it got started?\n\nWhat is new in the last two years since I talked to Dan Whitenack in episode 1?\nHow have the changes and additional features in Kubernetes impacted your work on Pachyderm?\n\n\nA recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?\nCan you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?\n\nHow does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?\n\n\nThere are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?\nData provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?\n\nWhat is the interface for exposing and exploring that provenance data?\n\n\nWhat are some of the advanced capabilities of Pachyderm that you would like to call out?\nWith your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?\nWhat have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?\nWhat do you have planned for the future of Pachyderm?\n\nContact Info\n\n@jdoliner on Twitter\nLinkedIn\njdoliner on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPachyderm\nRethinkDB\nAirBnB\nData Provenance\nKubeflow\nStateful Sets\nEtcD\nAirflow\nKafka\nGitHub\nGitLab\nDocker\nKubernetes\nCI == Continuous Integration\nCD == Continuous Delivery\nCeph\n\nPodcast Interview\n\n\nObject Storage\nMiniKube\nFUSE == File System In User Space\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how the open source Pachdyerm platform makes building flexible data pipelines with first class support for data lineage easy","date_published":"2019-05-26T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a40f4615-24a5-44b8-bb30-fc07c286ef09.mp3","mime_type":"audio/mpeg","size_in_bytes":33793559,"duration_in_seconds":2941}]},{"id":"podlove-2019-05-20t02:37:26+00:00-326b6593e40dc0e","title":"Build Your Data Analytics Like An Engineer With DBT","url":"https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81","content_text":"Summary\nIn recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what DBT is and your motivation for creating it?\nWhere does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline?\nCan you talk through the workflow for someone using DBT?\nOne of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented?\nThe packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?\n\nAre these packages driven by Fishtown Analytics or the dbt community?\n\n\nWhat are the limitations of modeling everything as a SELECT statement?\nMaking SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?\n\nWhat are your thoughts on higher level approaches to SQL that compile down to the specific statements?\n\n\nCan you explain how DBT is implemented and how the design has evolved since you first began working on it?\nWhat are some of the features of DBT that are often overlooked which you find particularly useful?\nWhat are some of the most interesting/unexpected/innovative ways that you have seen DBT used?\nWhat are the additional features that the commercial version of DBT provides?\nWhat are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT?\nWhen is it the wrong choice?\nWhat do you have planned for the future of DBT?\n\nContact Info\n\nEmail\n@drebanin on Twitter\ndrebanin on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDBT\nFishtown Analytics\n8Tracks Internet Radio\nRedshift\nMagento\nStitch Data\nFivetran\nAirflow\nBusiness Intelligence\nJinja template language\nBigQuery\nSnowflake\nVersion Control\nGit\nContinuous Integration\nTest Driven Development\nSnowplow Analytics\n\nPodcast Episode\n\n\ndbt-utils\nWe Can Do Better Than SQL blog post from EdgeDB\nEdgeDB\nLooker LookML\n\nPodcast Interview\n\n\nPresto DB\n\nPodcast Interview\n\n\nSpark SQL\nHive\nAzure SQL Data Warehouse\nData Warehouse\nData Lake\nData Council Conference\nSlowly Changing Dimensions\ndbt Archival\nMode Analytics\nPeriscope BI\ndbt docs\ndbt repository\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how dbt enables your data teams to build better analytics in your data warehouse","date_published":"2019-05-19T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f7b149b1-5584-4055-8ecb-8f4cf7d2e4ca.mp3","mime_type":"audio/mpeg","size_in_bytes":40657751,"duration_in_seconds":3406}]},{"id":"podlove-2019-05-06t23:57:02+00:00-827d45f1ac7c9bd","title":"Using FoundationDB As The Bedrock For Your Distributed Systems","url":"https://www.dataengineeringpodcast.com/foundationdb-distributed-systems-episode-80","content_text":"Summary\nThe database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you explain what FoundationDB is and how you got involved with the project?\nWhat are some of the unique use cases that FoundationDB enables?\nCan you describe how FoundationDB is architected?\n\nHow is the ACID compliance implemented at the cluster level?\n\n\nWhat are some of the mechanisms built into FoundationDB that contribute to its fault tolerance?\n\nHow are conflicts managed?\n\n\nFoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available?\n\nIs it possible to apply different layers, such as relational and document, to the same underlying objects in storage?\n\n\nOne of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput?\nFor someone who wants to run FoundationDB can you describe a typical deployment topology?\n\nWhat are the scaling factors for the underlying storage and for the Layers that are operating on the cluster?\n\n\nOnce you have a cluster deployed, what are some of the edge cases that users should watch out for?\n\nHow are version upgrades managed in a cluster?\n\n\nWhat are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB?\nWhat are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used?\nWhen is FoundationDB the wrong choice?\nWhat is in store for the future of FoundationDB?\n\nContact Info\n\nLinkedIn\n@ryanworl on Twitter\nWebsite\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nFoundationDB\nJepsen\nAndy Pavlo\nArchive.org – The Internet Archive\nFoundationDB Summit\nFlow Language\nC++\nActor Model\nErlang\nZookeeper\n\nPodcast Episode\n\n\nPAXOS consensus algorithm\nMulti-Version Concurrency Control (MVCC) AKA Optimistic Locking\nACID\nCAP Theorem\nRedis\nRecord Layer\nCloudKit\nDocument Layer\nSegment\n\nPodcast Episode\n\n\nNVMe\nSnowflakeDB\nFlatBuffers\nProtocol Buffers\nRyan Worl FoundationDB Summit Presentation\nGoogle F1\nGoogle Spanner\nWaveFront\nEtcD\nB+ Tree\nMichael Stonebraker\nThree Vs\nConfluent\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the FoundationDB project and how it simplifies the work of building custom distributed systems applications","date_published":"2019-05-06T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/84ff35c9-128f-4966-a3a2-e57a8a59ebec.mp3","mime_type":"audio/mpeg","size_in_bytes":44579412,"duration_in_seconds":3962}]},{"id":"podlove-2019-04-28t14:22:03+00:00-9f70737a6bfbe7c","title":"Running Your Database On Kubernetes With KubeDB","url":"https://www.dataengineeringpodcast.com/kubedb-kubernetes-database-episode-79","content_text":"Summary\nKubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tamal Saha about KubeDB, a project focused on making running production-grade databases easy on Kubernetes\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what KubeDB is and how the project got started?\nWhat are the main challenges associated with running a stateful system on top of Kubernetes?\n\nWhy would someone want to run their database on a container platform rather than on a dedicated instance or with a hosted service?\n\n\nCan you describe how KubeDB is implemented and how that has evolved since you first started working on it?\nCan you talk through how KubeDB simplifies the process of deploying and maintaining databases?\nWhat is involved in adding support for a new database?\n\nHow do the requirements change for systems that are natively clustered?\n\n\nHow does KubeDB help with maintenance processes around upgrading existing databases to newer versions?\nHow does the work that you are doing on KubeDB compare to what is available in StorageOS?\n\nAre there any other projects that are targeting similar goals?\n\n\nWhat have you found to be the most interesting/challenging/unexpected aspects of building KubeDB?\nWhat do you have planned for the future of the project?\n\nContact Info\n\nLinkedIn\n@tsaha on Twitter\nEmail\ntamalsaha on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nKubeDB\nAppsCode\nKubernetes\nKubernetes CRD (Custom Resource Definition)\nKubernetes Operator\nKubernetes Stateful Sets\nPostgreSQL\n\nPodcast Interview\n\n\nHashicorp Vault\nRedis\nElasticsearch\n\nPodcast Interview\n\n\nMySQL\nMemcached\nMongoDB\nDocker\nRook Storage Orchestration for Kubernetes\nCeph\n\nPodcast Interview\n\n\nEBS\nStorageOS\nGlusterFS\nOpenEBS\nCloudFoundry\nAppsCode Service Broker\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how to run your database on Kubernetes with the creator of KubeDB","date_published":"2019-04-28T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/91282043-600c-49d7-9ad8-218fe0f7980d.mp3","mime_type":"audio/mpeg","size_in_bytes":23645555,"duration_in_seconds":3054}]},{"id":"podlove-2019-04-22t13:58:13+00:00-70073c5f3064bd5","title":"Unpacking Fauna: A Global Scale Cloud Native Database","url":"https://www.dataengineeringpodcast.com/fauna-cloud-native-database-episode-78","content_text":"Summary\nOne of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what FaunaDB is and how it got started?\nWhat are some of the main use cases that FaunaDB is targeting?\n\nHow does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?\n\n\nCan you describe the architecture of FaunaDB and how it has evolved?\nThe consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?\n\nWhat are some of the edge cases that users should be aware of?\nHow are conflicts managed in Fauna?\n\n\nWhat is the underlying storage layer?\n\nHow is the query layer designed to allow for different query patterns and model representations?\n\n\nHow does data modeling in Fauna compare to that of relational or document databases?\n\nCan you describe the query format?\nWhat are some of the common difficulties or points of confusion around interacting with data in Fauna?\n\n\nWhat are some application design patterns that are enabled by using Fauna as the storage layer?\nGiven the ability to replicate globally, how do you mitigate latency when interacting with the database?\nWhat are some of the most interesting or unexpected ways that you have seen Fauna used?\nWhen is it the wrong choice?\nWhat have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company?\nWhat do you have in store for the future of Fauna?\n\nContact Info\n\n@evan on Twitter\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nFauna\nRuby on Rails\nCNET\nGitHub\nTwitter\nNoSQL\nCassandra\nInnoDB\nRedis\nMemcached\nTimeseries\nSpanner Paper\nDynamoDB Paper\nPercolator\nACID\nCalvin Protocol\nDaniel Abadi\nLINQ\nLSM Tree (Log-structured Merge-tree)\nScala\nChange Data Capture\nGraphQL\n\nPodcast.init Interview About Graphene\n\n\nFauna Query Language (FQL)\nCQL == Cassandra Query Language\nObject-Relational Databases\nLDAP == Lightweight Directory Access Protocol\nAuth0\nOLAP == Online Analytical Processing\nJepsen distributed systems safety research\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"A deep dive on building the Fauna database and how it supports transactions at global scale","date_published":"2019-04-22T14:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aca66046-3733-4364-bb9a-b0b266e6f27f.mp3","mime_type":"audio/mpeg","size_in_bytes":35069925,"duration_in_seconds":3230}]},{"id":"podlove-2019-04-15t13:15:20+00:00-f0dd4f4ca704840","title":"Index Your Big Data With Pilosa For Faster Analytics","url":"https://www.dataengineeringpodcast.com/pilosa-database-index-episode-77","content_text":"Summary\nDatabase indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nUnderstanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Pilosa is and how the project got started?\nWhere does Pilosa fit into the overall data ecosystem and how does it integrate into an existing stack?\nWhat types of use cases is Pilosa uniquely well suited for?\nThe Pilosa data model is fairly unique. Can you talk through how it is represented and implemented?\nWhat are some approaches to modeling data that might be coming from a relational database or some structured flat files?\n\nHow do you handle highly dimensional data?\n\n\nWhat are some of the decisions that need to be made early in the modeling process which could have ramifications later on in the lifecycle of the project?\nWhat are the scaling factors of Pilosa?\nWhat are some of the most interesting/challenging/unexpected lessons that you have learned in the process of building Pilosa?\nWhat is in store for the future of Pilosa?\n\nContact Info\n\nPilosa\n\nWebsite\nEmail\n@slothware on Twitter\n\n\nSeebs\n\nseebs on GitHub\nWebsite\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPQL (Pilosa Query Language)\nRoaring Bitmap\nWhitepaper\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Pilosa bitmap index server and how it can be used to run fast, continuous analytics on large and complex data sets","date_published":"2019-04-15T13:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/54b60985-b36f-4e92-b26f-21a97a325f7f.mp3","mime_type":"audio/mpeg","size_in_bytes":30200427,"duration_in_seconds":2621}]},{"id":"podlove-2019-04-07t18:47:31+00:00-0a8c181fc9fcbae","title":"Serverless Data Pipelines On DataCoral","url":"https://www.dataengineeringpodcast.com/datacoral-serverless-data-pipelines-episode-76","content_text":"Summary\nHow much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Raghu Murthy about DataCoral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what DataCoral is and your motivation for founding it?\nHow does the data-centric approach of DataCoral differ from the way that other platforms think about processing information?\nCan you describe how the DataCoral platform is designed and implemented, and how it has evolved since you first began working on it?\n\nHow does the concept of a data slice play into the overall architecture of your platform?\nHow do you manage transformations of data schemas and formats as they traverse different slices in your platform?\n\n\nOn your site it mentions that you have the ability to automatically adjust to changes in external APIs, can you discuss how that manifests?\nWhat has been your experience, both positive and negative, in building on top of serverless components?\nCan you discuss the customer experience of onboarding onto Datacoral and how it differs between existing data platforms and greenfield projects?\nWhat are some of the slices that have proven to be the most challenging to implement?\n\nAre there any that you are currently building that you are most excited for?\n\n\nHow much effort do you anticipate if and/or when you begin to support other cloud providers?\nWhen is Datacoral the wrong choice?\nWhat do you have planned for the future of Datacoral, both from a technical and business perspective?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDatacoral\nYahoo!\nApache Hive\nRelational Algebra\nSocial Capital\nEIR == Entrepreneur In Residence\nSpark\nKafka\nAWS Lambda\nDAG == Directed Acyclic Graph\nAWS Redshift\nAWS Athena\nAWS Glue\nNoisy Neighbor Problem\nCI/CD\nSnowflakeDB\nDataBricks Delta\nAWS Sagemaker\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how DataCoral is building an abstraction layer over data pipelines using microservices built on serverless technologies","date_published":"2019-04-07T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b815e224-023a-4fe5-a4dc-bb8abd988f64.mp3","mime_type":"audio/mpeg","size_in_bytes":33466714,"duration_in_seconds":3221}]},{"id":"podlove-2019-04-01t01:30:23+00:00-2f3ec9f67f88fd6","title":"Why Analytics Projects Fail And What To Do About It","url":"https://www.dataengineeringpodcast.com/primetsr-data-analytics-episode-75","content_text":"Summary\nAnalytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Eugene Khazin about the leading causes for failure in analytics projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nThe term \"analytics\" has grown to mean many different things to different people, so can you start by sharing your definition of what is in scope for an \"analytics project\" for the purposes of this discussion?\n\nWhat are the criteria that you and your customers use to determine the success or failure of a project?\n\n\nI was recently speaking with someone who quoted a Gartner report stating an estimated failure rate of ~80% for analytics projects. Has your experience reflected this reality, and what have you found to be the leading causes of failure in your experience at PrimeTSR?\nAs data engineers, what strategies can we pursue to increase the success rate of the projects that we work on?\nWhat are the contributing factors that are beyond our control, which we can help identify and surface early in the lifecycle of a project?\nIn the event of a failed project, what are the lessons that we can learn and fold into our future work?\n\nHow can we salvage a project and derive some value from the efforts that we have put into it?\n\n\nWhat are some useful signals to identify when a project is on the road to failure, and steps that can be taken to rescue it?\nWhat advice do you have for data engineers to help them be more active and effective in the lifecycle of an analytics project?\n\nContact Info\n\nEmail\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPrime TSR\nDescriptive, Predictive, and Prescriptive Analytics\nAzure Data Factory\nAzure Data Warehouse\nMulesoft\nSSIS (SQL Server Integration Services)\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the common factors that contribute to failure in analytics projects and how data engineers can help keep them on the path to success","date_published":"2019-03-31T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4cec4ad6-8d22-42a6-9ab0-81067c2fb8f5.mp3","mime_type":"audio/mpeg","size_in_bytes":23385542,"duration_in_seconds":2190}]},{"id":"podlove-2019-03-25t13:02:14+00:00-f2299e37385b198","title":"Building An Enterprise Data Fabric At CluedIn","url":"https://www.dataengineeringpodcast.com/cluedin-data-fabric-episode-74","content_text":"Summary\nData integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\nAlluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\n\n\nBefore we get started, can you share your definition of what a data fabric is?\n\n\nCan you explain what CluedIn is and share the story of how it started?\n\nCan you describe your ideal customer?\nWhat are some of the primary ways that organizations are using CluedIn?\n\n\n\nCan you give an overview of the system architecture that you have built and how it has evolved since you first began building it?\n\n\nFor a new customer of CluedIn, what is involved in the onboarding process?\n\n\nWhat are some of the most challenging aspects of data integration?\n\nWhat is your approach to managing the process of cleaning the data that you are ingesting?\n\nHow much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution?\n\n\nHow do you preserve and expose data lineage/provenance to your customers?\n\n\n\nHow do you manage changes or breakage in the interfaces that you use for source or destination systems?\n\n\nWhat are some of the signals that you monitor to ensure the continued healthy operation of your platform?\n\n\nWhat are some of the most notable customer success stories that you have experienced?\n\nAre there any notable failures that you have experienced, and if so, what were the lessons learned?\n\n\n\nWhat are some cases where CluedIn is not the right choice?\n\n\nWhat do you have planned for the future of CluedIn?\n\n\nContact Info\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nCluedIn\nCopenhagen, Denmark\nA/B Testing\nData Fabric\nDataiku\nRapidMiner\nAzure Machine Learning Studio\nCRM (Customer Relationship Management)\nGraph Database\nData Lake\nGraphQL\nDGraph\n\nPodcast Episode\n\n\nRabbitMQ\nGDPR (General Data Protection Regulation)\nMaster Data Management\n\nPodcast Interview\n\n\nOAuth\nDocker\nKubernetes\nHelm\nDevOps\nDataOps\nDevOps vs DataOps Podcast Interview\nKafka\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about building an enterprise data fabric at scale to ease enterprise data integration","date_published":"2019-03-25T09:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/03e5ce78-dc53-4714-aa4f-60c65dc23c11.mp3","mime_type":"audio/mpeg","size_in_bytes":41574019,"duration_in_seconds":3469}]},{"id":"podlove-2019-03-18t10:12:55+00:00-3a8bca21dc59372","title":"A DataOps vs DevOps Cookoff In The Data Kitchen","url":"https://www.dataengineeringpodcast.com/dataops-vs-devops-episode-73","content_text":"Summary\nDelivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\n\"There aren’t enough data conferences out there that focus on the community, so that’s why these folks built a better one\": Data Council is the premier community powered data platforms & engineering event for software engineers, data engineers, machine learning experts, deep learning researchers & artificial intelligence buffs who want to discover tools & insights to build new products. This year they will host over 50 speakers and 500 attendees (yeah that’s one of the best \"Attendee:Speaker\" ratios out there) in San Francisco on April 17-18th and are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Chris Bergh about the current state of DataOps and why it’s more than just DevOps for data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWe talked last year about what DataOps is, but can you give a quick overview of how the industry has changed or updated the definition since then?\n\nIt is easy to draw parallels between DataOps and DevOps, can you provide some clarity as to how they are different?\n\n\nHow has the conversation around DataOps influenced the design decisions of platforms and system components that are targeting the \"big data\" and data analytics ecosystem?\nOne of the commonalities is the desire to use collaboration as a means of reducing silos in a business. In the data management space, those silos are often in the form of distinct storage systems, whether application databases, corporate file shares, CRM systems, etc. What are some techniques that are rooted in the principles of DataOps that can help unify those data systems?\nAnother shared principle is in the desire to create feedback cycles. How do those feedback loops manifest in the lifecycle of an analytics project?\nTesting is critical to ensure the continued health and success of a data project. What are some of the current utilities that are available to data engineers for building and executing tests to cover the data lifecycle, from collection through to analysis and delivery?\nWhat are some of the components of a data analytics lifecycle that are resistant to agile or iterative development?\nWith the continued rise in the use of machine learning in production, how does that change the requirements for delivery and maintenance of an analytics platform?\nWhat are some of the trends that you are most excited for in the analytics and data platform space?\n\nContact Info\n\nData Kitchen\n\nEmail\n\n\nChris\n\nLinkedIn\n@ChrisBergh on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDownload the \"DataOps Cookbook\"\nData Kitchen\nPeace Corps\nMIT\nNASA\nMeyer’s Briggs Personality Test\nHBR (Harvard Business Review)\nMBA (Master of Business Administration)\nW. Edwards Deming\nDevOps\nLean Manufacturing\nTableau\nExcel\nAirflow\n\nPodcast.init Interview\n\n\nLooker\n\nPodcast Interview\n\n\nR Language\nAlteryx\nData Lake\nData Literacy\nData Governance\nDatadog\nKubernetes\nKubeflow\nMetis Machine\nGartner Hype Cycle\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the current state of DataOps and how it's not just DevOps for data","date_published":"2019-03-18T06:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c0745f04-7ce6-42a1-8974-79169d095e61.mp3","mime_type":"audio/mpeg","size_in_bytes":42126364,"duration_in_seconds":3271}]},{"id":"podlove-2019-03-04t17:17:36+00:00-b80ed5c8b933541","title":"Customer Analytics At Scale With Segment","url":"https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72","content_text":"Summary\nCustomer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.\nYour host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Segment is and how the business got started?\n\nWhat are some of the primary ways that your customers are using the Segment platform?\nHow have the capabilities and use cases of the Segment platform changed since it was first launched?\n\n\nLayered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the overall structure of Segment and the driving force behind their design and use?\nWhat are some of the best practices for structuring custom events in a way that they can be easily integrated with downstream platforms?\n\nHow do you manage changes or errors in the events generated by the various sources that you support?\n\n\nHow is the Segment platform architected and how has that architecture evolved over the past few years?\nWhat are some of the unique challenges that you face as a result of being a many-to-many event routing platform?\nIn addition to the various services that you integrate with for data delivery, you also support populating of data warehouses. What is involved in establishing and maintaining the schema and transformations for a customer?\nWhat have been some of the most interesting, unexpected, and/or challenging lessons that you have learned while building and growing the technical and business aspects of Segment?\nWhat are some of the features and improvements, both technical and business, that you have planned for the future?\n\nContact Info\n\nLinkedIn\n@calvinfo on Twitter\nWebsite\ncalvinfo on GitHub\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nSegment\nAWS\nClassMetric\nY Combinator\nAmplitude web and mobile analytics\nMixpanel\nKiss Metrics\nHacker News\nSegment Connections\nUser Analytics\nSalesForce\nRedshift\nBigQuery\nKinesis\nGoogle Cloud PubSub\nSegment Protocols data governance product\nSegment Personas\nHeap Analytics\n\nPodcast Episode\n\n\nHotel Tonight\nGolang\nKafka\nGDPR\nRocksDB\nDead Letter Queue\nSegment Centrifuge\nWebhook\nGoogle Analytics\nIntercom\nStripe\nGRPC\nDynamoDB\nFoundationDB\nParquet\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the platform Segment has built for routing streams of customer analytics data","date_published":"2019-03-04T12:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/07d376f2-3b3b-436d-8488-a3311dc4d143.mp3","mime_type":"audio/mpeg","size_in_bytes":34897399,"duration_in_seconds":2866}]},{"id":"podlove-2019-02-25t03:31:01+00:00-e6bacc4bdce9681","title":"Deep Learning For Data Engineers","url":"https://www.dataengineeringpodcast.com/deep-learning-data-engineers-episode-71","content_text":"Summary\nDeep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!\nManaging and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYou listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off\nYour host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving an overview of what deep learning is for anyone who isn’t familiar with it?\nWhat has been your personal experience with deep learning and what set you down that path?\nWhat is involved in building a data pipeline and production infrastructure for a deep learning product?\n\nHow does that differ from other types of analytics projects such as data warehousing or traditional ML?\n\n\nFor anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of?\nWhat are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate?\nWhat are some ways that we can use deep learning as part of the data management process?\n\nHow does that shift the infrastructure requirements for our platforms?\n\n\nCloud providers have been releasing numerous products to provide deep learning and/or GPUs as a managed platform. What are your thoughts on that layer of the build vs buy decision?\nWhat is your litmus test for whether to use deep learning vs explicit ML algorithms or a basic decision tree?\n\nDeep learning algorithms are often a black box in terms of how decisions are made, however regulations such as GDPR are introducing requirements to explain how a given decision gets made. How does that factor into determining what approach to take for a given project?\n\n\nFor anyone who wants to learn more about deep learning, what are some resources that you recommend?\n\nContact Info\n\nWebsite\nPluralsight\n@henson_tm on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nPluralsight\nDell EMC\nHadoop\nDBA (Database Administrator)\nElasticsearch\n\nPodcast Episode\n\n\nSpark\n\nPodcast Episode\n\n\nMapReduce\nDeep Learning\nMachine Learning\nNeural Networks\nFeature Engineering\nSVD (Singular Value Decomposition)\nAndrew Ng\n\nMachine Learning Course\n\n\nUnstructured Data Solutions Team of Dell EMC\nTensorflow\nPyTorch\nGPU (Graphics Processing Unit)\nNvidia RAPIDS\nProject Hydrogen\nSubmarine\nETL (Extract, Transform, Load)\nSupervised Learning\nUnsupervised Learning\nApache Kudu\n\nPodcast Episode\n\n\nCNN (Convolutional Neural Network)\nSentiment Analysis\nDataRobot\nGDPR\nWeapons Of Math Destruction by Cathy O’Neil\nBackpropagation\nDeep Learning Bootcamps\nThomas Henson Tensorflow Course on Pluralsight\nTFLearn\nGoogle ML Bootcamp\nCaffe deep learning framework\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.

\n

Announcements

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about what data engineers need to know about deep learning","date_published":"2019-02-24T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a11a190a-cb3f-491d-b4b1-b23e035dfff7.mp3","mime_type":"audio/mpeg","size_in_bytes":26732380,"duration_in_seconds":2566}]},{"id":"podlove-2019-02-19t00:57:31+00:00-546738e7d4c1cff","title":"Speed Up Your Analytics With The Alluxio Distributed Storage System","url":"https://www.dataengineeringpodcast.com/alluxio-distributed-storage-episode-70","content_text":"Summary\nDistributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.\nIntroduction\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Alluxio is and the history of the project?\n\nWhat are some of the use cases that Alluxio enables?\n\n\nHow is Alluxio implemented and how has its architecture evolved over time?\n\nWhat are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers?\n\n\nWhen dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management?\nWhat are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms?\n\nWhat are the tradeoffs that are made to provide a single API across systems with varying capabilities?\n\n\nTesting and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process?\n\nIn order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects?\n\n\nCan you describe a typical system topology that incorporates Alluxio?\nFor someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies?\n\nWhat are some edge cases or operational complexities that they should be aware of?\n\n\nWhat are some cases where Alluxio is the wrong choice?\n\nWhat are some projects or products that provide a similar capability to Alluxio?\n\n\nWhat do you have planned for the future of the Alluxio project and company?\n\nContact Info\n\nLinkedIn\n@binfan on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nAlluxio\n\nProject\nCompany\n\n\nCarnegie Mellon University\nMemcached\nKey/Value Storage\nUC Berkeley AMPLab\nApache Spark\n\nPodcast Episode\n\n\nPresto\n\nPodcast Episode\n\n\nTensorflow\nHDFS\nLRU Cache\nHive Metastore\nIceberg Table Format\n\nPodcast Episode\n\n\nJava\nDependency Hell\nJava Class Loader\nApache Zookeeper\n\nPodcast Interview\n\n\nRaft Consensus Algorithm\nConsistent Hashing\nAlluxio Testing At Scale Blog Post\nS3Guard\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.

\n

Introduction

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about the Alluxio distributed virtual in-memory file system","date_published":"2019-02-18T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b77057ef-d36c-4380-921f-313828cfc6df.mp3","mime_type":"audio/mpeg","size_in_bytes":39468806,"duration_in_seconds":3584}]},{"id":"podlove-2019-02-11t19:50:42+00:00-a33184ac3c9e56f","title":"Machine Learning In The Enterprise","url":"https://www.dataengineeringpodcast.com/prolego-ml-consulting-episode-69","content_text":"Summary\nMachine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.\nIntroduction\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Kevin Dewalt about his experiences at Prolego, building machine learning projects for Fortune 500 companies\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nFor the benefit of software engineers and team leaders who are new to machine learning, can you briefly describe what machine learning is and why is it relevant to them?\nWhat is your primary mission at Prolego and how did you identify, execute on, and establish a presence in your particular market?\n\nHow much of your sales process is spent on educating your clients about what AI or ML are and the benefits that these technologies can provide?\n\n\nWhat have you found to be the technical skills and capacity necessary for being successful in building and deploying a machine learning project?\n\nWhen engaging with a client, what have you found to be the most common areas of technical capacity or knowledge that are needed?\n\n\nEveryone talks about a talent shortage in machine learning. Can you suggest a recruiting or skills development process for companies which need to build out their data engineering practice?\nWhat challenges will teams typically encounter when creating an efficient working relationship between data scientists and data engineers?\nCan you briefly describe a successful project of developing a first ML model and putting it into production?\n\nWhat is the breakdown of how much time was spent on different activities such as data wrangling, model development, and data engineering pipeline development?\nWhen releasing to production, can you share the types of metrics that you track to ensure the health and proper functioning of the models?\nWhat does a deployable artifact for a machine learning/deep learning application look like?\n\n\nWhat basic technology stack is necessary for putting the first ML models into production?\n\nHow does the build vs. buy debate break down in this space and what products do you typically recommend to your clients?\n\n\nWhat are the major risks associated with deploying ML models and how can a team mitigate them?\nSuppose a software engineer wants to break into ML. What data engineering skills would you suggest they learn? How should they position themselves for the right opportunity?\n\nContact Info\n\nEmail: Kevin Dewalt kevin@prolego.io and Russ Rands russ@prolego.io\nConnect on LinkedIn: Kevin Dewalt and Russ Rands\nTwitter: @kevindewalt\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nProlego\nDownload our book: Become an AI Company in 90 Days\nGoogle Rules Of ML\nAI Winter\nMachine Learning\nSupervised Learning\nO’Reilly Strata Conference\nGE Rebranding Commercials\nJez Humble: Stop Hiring Devops Experts (And Start Growing Them)\nSQL\nORM\nDjango\nRoR\nTensorflow\nPyTorch\nKeras\nData Engineering Podcast Episode About Data Teams\nDevOps For Data Teams – DevOps Days Boston Presentation by Tobias\nJupyter Notebook\nData Engineering Podcast: Notebooks at Netflix\nPandas\n\nPodcast Interview\n\n\nJoel Grus\n\nJupyterCon Presentation\nData Science From Scratch\n\n\nExpensify\nAirflow\n\nJames Meickle Interview\n\n\nGit\nJenkins\nContinuous Integration\nPractical Deep Learning For Coders Course by Jeremy Howard\nData Carpentry\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.

\n

Introduction

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An interview about how to build, launch, and maintain machine learning products","date_published":"2019-02-11T14:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/481d2917-72a1-4148-aa3c-b8ec1b30260d.mp3","mime_type":"audio/mpeg","size_in_bytes":33558615,"duration_in_seconds":2898}]},{"id":"podlove-2019-02-04t01:29:44+00:00-ee17acd06cbb8a5","title":"Cleaning And Curating Open Data For Archaeology","url":"https://www.dataengineeringpodcast.com/open-context-open-data-platform-episode-68","content_text":"Summary\nArchaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.\nIntroduction\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data\n\nInterview\n\n\nIntroduction\n\n\nHow did you get involved in the area of data management?\nI did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.\n\n\nCan you start by describing what Open Context is and how it started?\nOpen Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.\n\n\nWhat are your protocols for determining which data sets you will work with?\nDatasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.\n\n\nWhat are some of the challenges unique to research data?\n\n\nWhat are some of the unique requirements for processing, publishing, and archiving research data?\nYou have to work on a shoe-string budget, essentially providing \"public goods\". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.\nAnother issues is that it will take a long time to publish enough data to power many \"meta-analyses\" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.\n\n\n\n\nHow much education is necessary around your content licensing for researchers who are interested in publishing their data with you?\nWe require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.\n\n\nCan you describe the system architecture that you use for Open Context?\nOpen Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.\n\n\nWhat is the process for cleaning and formatting the data that you host?\n\n\nHow much domain expertise is necessary to ensure proper conversion of the source data?\nThat’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.\n\n\nCan you discuss the challenges that you face in maintaining a consistent ontology?\n\n\nWhat pieces of metadata do you track for a given data set?\n\n\n\n\nCan you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?\n\nCan you walk through the lifecycle of a given data set?\n\n\n\nData archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?\n\n\nOnce the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?\n\n\nWhat are some of the most interesting uses you have seen of the data that is hosted on Open Context?\n\n\nWhat have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?\n\n\nWhat are your goals for the future of Open Context?\n\n\nContact Info\n\n@ekansa on Twitter\nLinkedIn\nResearchGate\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nOpen Context\nBronze Age\nGIS (Geographic Information System)\nFilemaker\nAccess Database\nExcel\nCreative Commons\nOpen Context On Github\nDjango\nPostgreSQL\nApache Solr\nGeoJSON\nJSON-LD\nRDF\nOCHRE\nSKOS (Simple Knowledge Organization System)\nDjango Reversion\nCalifornia Digital Library\nZenodo\nCERN\nDigital Index of North American Archaeology (DINAA)\nAnsible\nDocker\nOpenRefine\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

\n

Introduction

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An Interview About Building An Open Data Platform For Archaeologists","date_published":"2019-02-03T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ab250018-9696-4fde-b9b7-ae9ad87920fd.mp3","mime_type":"audio/mpeg","size_in_bytes":37460529,"duration_in_seconds":3655}]},{"id":"podlove-2019-01-29t03:36:25+00:00-89179b1adbef8d8","title":"Managing Database Access Control For Teams With strongDM","url":"https://www.dataengineeringpodcast.com/strongdm-database-access-control-episode-67","content_text":"Summary\nControlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.\nIntroduction\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Justin McCarthy about StrongDM, a hosted service that simplifies access controls for your data\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining the problem that StrongDM is solving and how the company got started?\n\nWhat are some of the most common challenges around managing access and authentication for data storage systems?\nWhat are some of the most interesting workarounds that you have seen?\nWhich areas of authentication, authorization, and auditing are most commonly overlooked or misunderstood?\n\n\nCan you describe the architecture of your system?\n\nWhat strategies have you used to enable interfacing with such a wide variety of storage systems?\n\n\nWhat additional capabilities do you provide beyond what is natively available in the underlying systems?\nWhat are some of the most difficult aspects of managing varying levels of permission for different roles across the diversity of platforms that you support, given that they each have different capabilities natively?\nFor a customer who is onboarding, what is involved in setting up your platform to integrate with their systems?\nWhat are some of the assumptions that you made about your problem domain and market when you first started which have been disproven?\nHow do organizations in different industries react to your product and how do their policies around granting access to data differ?\nWhat are some of the most interesting/unexpected/challenging lessons that you have learned in the process of building and growing StrongDM?\n\nContact Info\n\nLinkedIn\n@justinm on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nStrongDM\nAuthentication Vs. Authorization\nHashicorp Vault\nConfiguration Management\nChef\nPuppet\nSaltStack\nAnsible\nOkta\nSSO (Single Sign On\nSOC 2\nTwo Factor Authentication\nSSH (Secure SHell)\nRDP\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.

\n

Introduction

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An Interview About strongDM's Approach To Managing Access To Multiple Databases","date_published":"2019-01-28T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/de7100ab-96d9-4dcf-ba19-582f0f467eeb.mp3","mime_type":"audio/mpeg","size_in_bytes":30014637,"duration_in_seconds":2537}]},{"id":"podlove-2019-01-21t13:22:59+00:00-7fa46a139633ce6","title":"Building Enterprise Big Data Systems At LEGO","url":"https://www.dataengineeringpodcast.com/lego-enterprise-big-data-episode-66","content_text":"Summary\nBuilding internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nMy understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?\n\nWhat kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data?\nWhat was the transition process like, migrating data silos into a uniformly managed platform?\n\n\nWhat are the biggest data challenges that you face at LEGO?\nWhat are some of the most critical sources and types of data that you are managing?\nWhat are the main components of the data infrastructure that you have built to support the organizations analytical needs?\n\nWhat are some of the technologies that you have found to be most useful?\nWhich have been the most problematic?\n\n\nWhat does the team structure look like for the data services at LEGO?\n\nDoes that reflect in the types/numbers of systems that you support?\n\n\nWhat types of testing, monitoring, and metrics do you use to ensure the health of the systems you support?\nWhat have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO?\nHow have the data systems at Lego evolved over recent years as new technologies and techniques have been developed?\nHow does the global nature of the LEGO business influence the design strategies and technology choices for your platform?\nWhat are you most excited for in the coming year?\n\nContact Info\n\nJesper\n\nLinkedIn\n\n\nKeld\n\nLinkedIn\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nLEGO Group\nERP (Enterprise Resource Planning)\nPredictive Analytics\nPrescriptive Analytics\nHadoop\nCenter Of Excellence\nContinuous Integration\nSpark\n\nPodcast Episode\n\n\nApache NiFi\n\nPodcast Episode\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"

Summary

\n

Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

\n

Preamble

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"An Interview With The Founding Members Of The LEGO Big Data Team","date_published":"2019-01-21T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/42bff9c3-2047-46fd-a558-b71dae7a729b.mp3","mime_type":"audio/mpeg","size_in_bytes":32965869,"duration_in_seconds":2883}]},{"id":"podlove-2019-01-14t00:48:44+00:00-8efed9f35824bf4","title":"TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65","url":"https://www.dataengineeringpodcast.com/timescaledb-round-2-episode-65","content_text":"Summary\n\nThe past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.\n\nIntroduction\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you refresh our memory about what TimescaleDB is?\nHow has the market for timeseries databases changed since we last spoke?\nWhat has changed in the focus and features of the TimescaleDB project and company?\nToward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?\n\nWhat were the most challenging aspects of reaching that goal?\n\n\n\nIn terms of timeseries workloads, what are some of the factors that differ across varying use cases?\n\n\nHow do those differences impact the ways in which Timescale is used by the end user, and built by your team?\n\n\n\nWhat are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?\nHow have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?\n\n\nHave you been able to leverage some of the native improvements to simplify your implementation?\nAre there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?\n\n\n\nWhat is in store for the future of the Timescale product and organization?\n\n\nContact Info\n\n\nAjay\n\n@acoustik on Twitter\nLinkedIn\n\n\n\nMike\n\n\nLinkedIn\nWebsite\n@michaelfreedman on Twitter\n\n\n\nTimescale\n\n\nWebsite\nDocumentation\nCareers\ntimescaledb on GitHub\n@timescaledb on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTimescaleDB\nOriginal Appearance on the Data Engineering Podcast\n1.0 Release Blog Post\nPostgreSQL\n\nPodcast Interview\n\n\n\nRDS\nDB-Engines\nMongoDB\nIOT (Internet Of Things)\nAWS Timestream\nKafka\nPulsar\n\n\nPodcast Episode\n\n\n\nSpark\n\n\nPodcast Episode\n\n\n\nFlink\n\n\nPodcast Episode\n\n\n\nHadoop\nDevOps\nPipelineDB\n\n\nPodcast Interview\n\n\n\nGrafana\nTableau\nPrometheus\nOLTP (Online Transaction Processing)\nOracle DB\nData Lake\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

\n\n

Introduction

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Checking In On The Time Series Database Market With TimescaleDB (Interview)","date_published":"2019-01-13T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/499265b5-0538-4257-8b80-6a61729d2708.mp3","mime_type":"audio/mpeg","size_in_bytes":30150129,"duration_in_seconds":2485}]},{"id":"podlove-2019-01-05t03:38:04+00:00-56cffdb23cf7db1","title":"Performing Fast Data Analytics Using Apache Kudu - Episode 64","url":"https://www.dataengineeringpodcast.com/apache-kudu-with-brock-noland-and-jordan-birdsell-episode-64","content_text":"Summary\n\nThe Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Kudu is and the motivation for building it?\n\nHow does it fit into the Hadoop ecosystem?\nHow does it compare to the work being done on the Iceberg table format?\n\n\n\nWhat are some of the common application and system design patterns that Kudu supports?\nHow is Kudu architected and how has it evolved over the life of the project?\nThere are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu?\nHow does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?\n\n\nWhat are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?\n\n\n\nA number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure?\nWhat was the motivation for using C++ as the language target for Kudu?\n\n\nIf you were to start the project over today what would you do differently?\n\n\n\nWhat are some situations where you would advise against using Kudu?\nWhat have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu?\nWhat are you most excited about for the future of Kudu?\n\n\nContact Info\n\n\nBrock\n\nLinkedIn\n@brocknoland on Twitter\n\n\n\nJordan\n\n\nLinkedIn\n@jordanbirdsell\njbirdsell on GitHub\n\n\n\nPhData\n\n\nWebsite\nphdata on GitHub\n@phdatainc on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nKudu\nPhData\nGetting Started with Apache Kudu\nThomson Reuters\nHadoop\nOracle Exadata\nSlowly Changing Dimensions\nHDFS\nS3\nAzure Blob Storage\nState Farm\nStanly Black & Decker\nETL (Extract, Transform, Load)\nParquet\n\nPodcast Episode\n\n\n\nORC\nHBase\nSpark\n\n\nPodcast Episode\n\n\n\nImpala\nNetflix Iceberg\n\n\nPodcast Episode\n\n\n\nHive ACID\nIOT (Internet Of Things)\nStreamsets\nNiFi\n\n\nPodcast Episode\n\n\n\nKafka Connect\nMoore’s Law\n3D XPoint\nRaft Consensus Algorithm\nSTONITH (Shoot The Other Node In The Head)\nYarn\nCython\n\n\nPodcast.__init__ Episode\n\n\n\nPandas\n\n\nPodcast.__init__ Episode\n\n\n\nCloudera Manager\nApache Sentry\nCollibra\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)","date_published":"2019-01-06T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3cc0f209-42b2-4537-bd13-84f17db97869.mp3","mime_type":"audio/mpeg","size_in_bytes":34223144,"duration_in_seconds":3046}]},{"id":"podlove-2018-12-31t13:08:40+00:00-626851842c3eba3","title":"Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63","url":"https://www.dataengineeringpodcast.com/pravega-with-tom-kaitchuck-episode-63","content_text":"Summary\n\nAs more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Pravega is and the story behind it?\nWhat are the use cases for Pravega and how does it fit into the data ecosystem?\n\nHow does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?\n\n\n\nHow do you represent a stream on-disk?\n\n\nWhat are the benefits of using this format for persisted streams?\n\n\n\nOne of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?\nI am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?\nFor someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?\nWhat are some of the unique system design patterns that are made possible by Pravega?\nHow is Pravega architected internally?\nWhat is involved in integrating engines such as Spark, Flink, or Storm with Pravega?\nA common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?\n\n\nDoes it have any special capabilities for simplifying processing of out-of-order events?\n\n\n\nFor someone planning a deployment of Pravega, what is involved in building and scaling a cluster?\n\n\nWhat are some of the operational edge cases that users should be aware of?\n\n\n\nWhat are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?\nWhat are some cases where you would recommend against using Pravega?\nWhat is in store for the future of Pravega?\n\n\nContact Info\n\n\ntkaitchuk on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPravega\nAmazon SQS (Simple Queue Service)\nAmazon Simple Workflow Service (SWF)\nAzure\nEMC\nZookeeper\n\nPodcast Episode\n\n\n\nBookkeeper\nKafka\nPulsar\n\n\nPodcast Episode\n\n\n\nRocksDB\nFlink\n\n\nPodcast Episode\n\n\n\nSpark\n\n\nPodcast Episode\n\n\n\nHeron\nLambda Architecture\nKappa Architecture\nErasure Code\nFlink Forward Conference\nCAP Theorem\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Stream-Native Storage For Unbounded Data With Pravega (Interview)","date_published":"2018-12-31T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aa0efe25-0d55-4ab2-9308-356db6ec237c.mp3","mime_type":"audio/mpeg","size_in_bytes":30749877,"duration_in_seconds":2682}]},{"id":"podlove-2018-12-24t02:50:51+00:00-1acaf02e5e97af9","title":"Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62","url":"https://www.dataengineeringpodcast.com/pipelinedb-with-derek-nelson-and-usman-masood-episode-62","content_text":"Summary\n\nProcessing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what PipelineDB is and the motivation for creating it?\n\nWhat are the major use cases that it enables?\nWhat are some example applications that are uniquely well suited to the capabilities of PipelineDB?\n\n\n\nWhat are the major concepts and components that users of PipelineDB should be familiar with?\nGiven the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?\nWhat are some of the common patterns for populating data streams?\nWhat are the options for scaling PipelineDB systems, both vertically and horizontally?\n\n\nHow much elasticity does the system support in terms of changing volumes of inbound data?\nWhat are some of the limitations or edge cases that users should be aware of?\n\n\n\nGiven that inbound data is not persisted to disk, how do you guard against data loss?\n\n\nIs it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?\nCan a separate table be used as an input stream?\n\n\n\nSince the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?\nWhat are some of the features that you have found to be the most useful which users might initially overlook?\nWhat would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?\nWhat are some of the most challenging aspects of building continuous aggregates on unbounded data?\nWhat have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?\nWhat are some of the most interesting or unexpected ways that you have seen PipelineDB used?\nWhen is PipelineDB the wrong choice?\nWhat do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?\n\n\nContact Info\n\n\nDerek\n\nderekjn on GitHub\nLinkedIn\n\n\n\nUsman\n\n\n@usmanm on Twitter\nWebsite\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPipelineDB\nStride\nPostgreSQL\n\nPodcast Episode\n\n\n\nAdRoll\nProbabilistic Data Structures\nTimescaleDB\n\n\n[Podcast Episode](\n\n\n\nHive\nRedshift\nKafka\nKinesis\nZeroMQ\nNanomsg\nHyperLogLog\nBloom Filter\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Real-Time Analysis Of Time-Series Data In PostgreSQL With PipelineDB (Interview)","date_published":"2018-12-23T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/919c1062-8f5f-45ac-b697-cdf79fb78a16.mp3","mime_type":"audio/mpeg","size_in_bytes":41523140,"duration_in_seconds":3831}]},{"id":"podlove-2018-12-17t03:26:45+00:00-95d1d948f61e999","title":"Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61","url":"https://www.dataengineeringpodcast.com/advice-on-scaling-your-data-pipeline-alongside-your-business-with-christian-heinzmann-episode-61","content_text":"Summary\n\nEvery business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your definition of a data pipeline?\n\nAt what point in the life of a project or organization should you start thinking about building a pipeline?\n\n\n\nIn the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?\n\n\nWhat metrics/use cases should you be optimizing for at this point?\n\n\n\nWhat are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?\n\n\nHow do the design requirements for a data pipeline change as you reach this stage?\nWhat are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?\n\n\n\nWhat are some of the changes that are necessary as you move to a large scale data pipeline?\nAt each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?\nIn recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?\nWhen performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?\nTransformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?\nWhat are your selection criteria when determining what workflow or ETL engines to use in your pipeline?\n\n\nHow has your preference of build vs buy changed at different scales of operation and as new/different projects become available?\n\n\n\nWhat are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?\nWhat are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?\nWhat are your plans for improving your current pipeline at Grubhub?\nWhat are some references that you recommend for anyone who is designing a new data platform?\n\n\nContact Info\n\n\n@sirchristian on Twitter\nBlog\nsirchristian on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nScaling ETL blog post\nGrubHub\nData Warehouse\nRedshift\nSpark\n\nSpark In Action Podcast Episode\n\n\n\nHive\nAmazon EMR\nLooker\n\n\nPodcast Episode\n\n\n\nRedash\nMetabase\n\n\nPodcast Episode\n\n\n\nA Primer on Enterprise Data Curation\nPub/Sub (Publish-Subscribe Pattern)\nChange Data Capture\nJenkins\nPython\nAzkaban\nLuigi\nZendesk\nData Lineage\nAirBnB Engineering Blog\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"The Evolution Of ETL As A Function Of Business Growth (Interview)","date_published":"2018-12-16T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/240c56c8-369c-40bc-8459-0421b29d4c35.mp3","mime_type":"audio/mpeg","size_in_bytes":30203381,"duration_in_seconds":2362}]},{"id":"podlove-2018-12-10t03:02:40+00:00-c9d4bf6d08fc6f5","title":"Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60","url":"https://www.dataengineeringpodcast.com/putting-apache-spark-into-action-with-jean-georges-perrin-episode-60","content_text":"Summary\n\nApache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Spark is?\n\nWhat are some of the main use cases for Spark?\nWhat are some of the problems that Spark is uniquely suited to address?\nWho uses Spark?\n\n\n\nWhat are the tools offered to Spark users?\nHow does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?\nFor someone building on top of Spark what are the main software design paradigms?\n\n\nHow does the design of an application change as you go from a local development environment to a production cluster?\n\n\n\nOnce your application is written, what is involved in deploying it to a production environment?\nWhat are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?\nWhat are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?\nWhat are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?\nWhat are the limitations of the Spark programming model?\n\n\nWhat are the cases where Spark is the wrong choice?\n\n\n\nWhat was your motivation for writing a book about Spark?\n\n\nWho is the target audience?\n\n\n\nWhat have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?\nWhat advice do you have for anyone who is considering or currently using Spark?\n\n\nContact Info\n\n\n@jgperrin on Twitter\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nBook Discount\n\n\nUse the code poddataeng18 to get 40% off of all of Manning’s products at manning.com\n\n\nLinks\n\n\nApache Spark\nSpark In Action\nBook code examples in GitHub\nInformix\nInternational Informix Users Group\nMySQL\nMicrosoft SQL Server\nETL (Extract, Transform, Load)\nSpark SQL and Spark In Action‘s chapter 11\nSpark ML and Spark In Action‘s chapter 18\nSpark Streaming (structured) and Spark In Action‘s chapter 10\nSpark GraphX\nHadoop\nJupyter\n\nPodcast Interview\n\n\n\nZeppelin\nDatabricks\nIBM Watson Studio\nKafka\nFlink\n\n\nPodcast Episode\n\n\n\nAWS Kinesis\nYarn\nHDFS\nHive\nScala\nPySpark\nDAG\nSpark Catalyst\nSpark Tungsten\nSpark UDF\nAWS EMR\nMesos\nDC/OS\nKubernetes\nDataframes\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Book Discount

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Tackling Apache Spark From The Data Engineer's Perspective (Interview)","date_published":"2018-12-09T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/23a22282-ba13-48f2-b655-a919de993dbc.mp3","mime_type":"audio/mpeg","size_in_bytes":41056849,"duration_in_seconds":3031}]},{"id":"podlove-2018-12-03t03:03:53+00:00-94eaf4cf8c919f9","title":"Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59","url":"https://www.dataengineeringpodcast.com/apache-zookeeper-with-patrick-hunt-episode-59","content_text":"\nSummary\nDistributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Zookeeper is and how the project got started?\n\nWhat are the main motivations for using a centralized coordination service for distributed systems?\n\n\nWhat are the distributed systems primitives that are built into Zookeeper?\n\nWhat are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?\nWhat are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?\n\n\nCan you discuss how Zookeeper is architected and how that design has evolved over time?\n\nWhat have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?\n\n\nWhat are the scaling factors for Zookeeper?\n\nWhat are the edge cases that users should be aware of?\nWhere does it fall on the axes of the CAP theorem?\n\n\nWhat are the main failure modes for Zookeeper?\n\nHow much of the recovery logic is left up to the end user of the Zookeeper cluster?\n\n\nSince there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?\nIn recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?\n\nWhat are some of the cases where Zookeeper is the wrong choice?\n\n\nHow have the needs of distributed systems engineers changed since you first began working on Zookeeper?\nIf you were to start the project over today, what would you do differently?\n\nWould you still use Java?\n\n\nWhat are some of the most interesting or unexpected ways that you have seen Zookeeper used?\nWhat do you have planned for the future of Zookeeper?\n\nContact Info\n\n@phunt on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nZookeeper\nCloudera\nGoogle Chubby\nSourceforge\nHBase\nHigh Availability\nFallacies of distributed computing\nFalsehoods programmers believe about networking\nConsul\nEtcD\nApache Curator\nRaft Consensus Algorithm\nZookeeper Atomic Broadcast\nSSD Write Cliff\nApache Kafka\nApache Flink\n\nPodcast Episode\n\n\nHDFS\nKubernetes\nNetty\nProtocol Buffers\nAvro\nRust\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"
\n

Summary

\n

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

\n

Preamble

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"Building Distributed Systems On Top Of Apache Zookeeper (Interview)","date_published":"2018-12-02T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/47e2f00a-b793-4f06-9621-bf2ea7cd8ebd.mp3","mime_type":"audio/mpeg","size_in_bytes":33042368,"duration_in_seconds":3265}]},{"id":"podlove-2018-11-26t01:38:26+00:00-ea3317b973154a9","title":"Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58","url":"https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58","content_text":"Summary\n\nWhen your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Dremio is and how the project and business got started?\n\nWhat was the motivation for keeping your primary product open source?\nWhat is the governance model for the project?\n\n\n\nHow does Dremio fit in the current landscape of data tools?\n\n\nWhat are some use cases that Dremio is uniquely equipped to support?\nDo you think that Dremio obviates the need for a data warehouse or large scale data lake?\n\n\n\nHow is Dremio architected internally?\n\n\nHow has that architecture evolved from when it was first built?\n\n\n\nThere are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product?\n\n\nWhat are the benefits of integrating all of those capabilities into a single system?\nWhat are the drawbacks?\n\n\n\nOne of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled?\nFor someone who is interested in deploying Dremio to their environment what is involved in getting it installed?\n\n\nWhat are the scaling factors?\n\n\n\nWhat are some of the most exciting features that have been added in recent releases?\nWhen is Dremio the wrong choice?\nWhat have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio?\nWhat do you have planned for the future of Dremio?\n\n\nContact Info\n\n\nTomer\n\n@tshiran on Twitter\nLinkedIn\n\n\n\nDremio\n\n\nWebsite\n@dremio on Twitter\ndremio on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDremio\nMapR\nPresto\nBusiness Intelligence\nArrow\nTableau\nPower BI\nJupyter\nOLAP Cube\nApache Foundation\nHadoop\nNikon DSLR\nSpark\nETL (Extract, Transform, Load)\nParquet\nAvro\nK8s\nHelm\nYarn\nGandiva Initiative for Apache Arrow\nLLVM\nTLS\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Building The Dremio Open Source Data-as-a-Service Platform (Interview)","date_published":"2018-11-25T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/025a776c-ba80-4854-a29d-d50e1a5df4f8.mp3","mime_type":"audio/mpeg","size_in_bytes":30125935,"duration_in_seconds":2358}]},{"id":"podlove-2018-11-19t00:09:52+00:00-61f515f3979b867","title":"Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57","url":"https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57","content_text":"Summary\n\nModern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Flink is and how the project got started?\nWhat are some of the primary ways that Flink is used?\nHow does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?\n\nWhat are some use cases that Flink is uniquely qualified to handle?\n\n\n\nWhere does Flink fit into the current data landscape?\nHow is Flink architected?\n\n\nHow has that architecture evolved?\nAre there any aspects of the current design that you would do differently if you started over today?\n\n\n\nHow does scaling work in a Flink deployment?\n\n\nWhat are the scaling limits?\nWhat are some of the failure modes that users should be aware of?\n\n\n\nHow is the statefulness of a cluster managed?\n\n\nWhat are the mechanisms for managing conflicts?\nWhat are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?\nCan state be shared across processes or tasks within a Flink cluster?\n\n\n\nWhat are the comparative challenges of working with bounded vs unbounded streams of data?\nHow do you handle out of order events in Flink, especially as the delay for a given event increases?\nFor someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?\nWhat are some of the most challenging or complicated aspects of building and maintaining Flink?\nWhat are some of the most interesting or unexpected ways that you have seen Flink used?\nWhat are some of the improvements or new features that are planned for the future of Flink?\nWhat are some features or use cases that you are explicitly not planning to support?\nFor people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?\n\n\nWhat do they find most interesting or exciting?\n\n\n\n\n\nContact Info\n\n\nLinkedIn\n@fhueske on Twitter\nfhueske on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nFlink\nData Artisans\nIBM\nDB2\nTechnische Universität Berlin\nHadoop\nRelational Database\nGoogle Cloud Dataflow\nSpark\nCascading\nJava\nRocksDB\nFlink Checkpoints\nFlink Savepoints\nKafka\nPulsar\nStorm\nScala\nLINQ (Language INtegrated Query)\nSQL\nBackpressure\nWatermarks\nHDFS\nS3\nAvro\nJSON\nHive Metastore\nDell EMC\nPravega\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Scalable and Stateful Streaming Data With Apache Flink (Interview)","date_published":"2018-11-18T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8529d7fa-4286-4292-ae22-d2b09cce42d5.mp3","mime_type":"audio/mpeg","size_in_bytes":39909768,"duration_in_seconds":2881}]},{"id":"podlove-2018-11-11t20:56:21+00:00-77d9d6217f217c6","title":"How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56","url":"https://www.dataengineeringpodcast.com/upsolver-with-yoni-iny-episode-56","content_text":"Summary\n\nA data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Upsolver is and how it got started?\n\nWhat are your goals for the platform?\n\n\n\nThere are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?\n\n\nWhat are the shortcomings of a data lake architecture?\n\n\n\nHow is Upsolver architected?\n\n\nHow has that architecture changed over time?\nHow do you manage schema validation for incoming data?\nWhat would you do differently if you were to start over today?\n\n\n\nWhat are the biggest challenges at each of the major stages of the data lake?\nWhat is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?\nWhen is Upsolver the wrong choice for an organization considering implementation of a data platform?\nIs there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?\nWhat features or improvements do you have planned for the future of Upsolver?\n\n\nContact Info\n\n\nYoni\n\nyoniiny on GitHub\nLinkedIn\n\n\n\nUpsolver\n\n\nWebsite\n@upsolver on Twitter\nLinkedIn\nFacebook\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nUpsolver\nData Lake\nIsraeli Army\nData Warehouse\nData Engineering Podcast Episode About Data Curation\nThree Vs\nKafka\nSpark\nPresto\nDrill\nSpot Instances\nObject Storage\nCassandra\nRedis\nLatency\nAvro\nParquet\nORC\nData Engineering Podcast Episode About Data Serialization Formats\nSSTables\nRun Length Encoding\nCSV (Comma Separated Values)\nProtocol Buffers\nKinesis\nETL\nDevOps\nPrometheus\nCloudwatch\nDataDog\nInfluxDB\nSQL\nPandas\nConfluent\nKSQL\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Building A Data Lake Platform In The Cloud At Upsolver (Interview)","date_published":"2018-11-11T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/74c7daab-0d26-4b70-a18e-f3b19f418ab1.mp3","mime_type":"audio/mpeg","size_in_bytes":29082977,"duration_in_seconds":3110}]},{"id":"podlove-2018-11-05t01:42:46+00:00-a2201e965b3e139","title":"Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55","url":"https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55","content_text":"Summary\n\nBusiness intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Looker is and the problem that it is aiming to solve?\n\nHow do you define business intelligence?\n\n\n\nHow is Looker unique from other approaches to business intelligence in the enterprise?\n\n\nHow does it compare to open source platforms for BI?\n\n\n\nCan you describe the technical infrastructure that supports Looker?\nGiven that you are connecting to the customer’s data store, how do you ensure sufficient security?\nFor someone who is using Looker, what does their workflow look like?\n\n\nHow does that change for different user roles (e.g. data engineer vs sales management)\n\n\n\nWhat are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?\nWhat are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?\n\n\nWhat are the portions of the Looker architecture that you would do differently if you were to start over today?\n\n\n\nWhat are some of the most interesting or unusual uses of Looker that you have seen?\nWhat is in store for the future of Looker?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nLooker\nUpworthy\nMoveOn.org\nLookML\nSQL\nBusiness Intelligence\nData Warehouse\nLinux\nHadoop\nBigQuery\nSnowflake\nRedshift\nDB2\nPostGres\nETL (Extract, Transform, Load)\nELT (Extract, Load, Transform)\nAirflow\nLuigi\nNiFi\nData Curation Episode\nPresto\nHive\nAthena\nDRY (Don’t Repeat Yourself)\nLooker Action Hub\nSalesforce\nMarketo\nTwilio\nNetscape Navigator\nDynamic Pricing\nSurvival Analysis\nDevOps\nBigQuery ML\nSnowflake Data Sharehouse\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Easy And Powerful Self Service Business Intelligence With Looker (Interview)","date_published":"2018-11-04T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4f14f000-9269-46f2-b63d-fb7d95cbba36.mp3","mime_type":"audio/mpeg","size_in_bytes":34471531,"duration_in_seconds":3484}]},{"id":"podlove-2018-10-29t01:11:31+00:00-bc8bc0c289f7e69","title":"Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54","url":"https://www.dataengineeringpodcast.com/using-notebooks-as-the-unifying-layer-for-data-roles-at-netflix-with-matthew-seal-episode-54","content_text":"Summary\n\nJupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?\n\nWhere are you using notebooks and where are you not?\n\n\n\nWhat is the technical infrastructure that you have built to suppport that design choice?\nWhich team was driving the effort?\n\n\nWas it difficult to get buy in across teams?\n\n\n\nHow much shared code have you been able to consolidate or reuse across teams/roles?\nHave you investigated the use of any of the other notebook platforms for similar workflows?\nWhat are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?\nWhat are some of the limitations of the notebook environment for the work that you are doing?\nWhat have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?\nWhat are some of the projects that are ongoing or planned for the future that you are most excited by?\n\n\nContact Info\n\n\nMatthew Seal\n\nEmail\nLinkedIn\n@codeseal on Twitter\nMSeal on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nNetflix Notebook Blog Posts\nNteract Tooling\nOpenGov\nProject Jupyter\nZeppelin Notebooks\nPapermill\nTitus\nCommuter\nScala\nPython\nR\nEmacs\nNBDime\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"How Netflix Is Using Jupyter Notebooks In Production (Interview)","date_published":"2018-10-28T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b1d91aff-fde9-413b-81b4-904e66268255.mp3","mime_type":"audio/mpeg","size_in_bytes":32126228,"duration_in_seconds":2454}]},{"id":"podlove-2018-10-22t01:49:12+00:00-7b6367ae353b3ac","title":"Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53","url":"https://www.dataengineeringpodcast.com/deon-with-emily-miller-and-peter-bull-episode-53","content_text":"Summary\n\nAs data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nThis is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com\n\n\nInterview\n\n\nIntroductions\nHow did you get introduced to Python?\nCan you start by describing what Deon is and your motivation for creating it?\nWhy a checklist, specifically? What’s the advantage of this over an oath, for example?\nWhat is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?\nWhat is the typical workflow for a team that is using Deon in their projects?\nDeon ships with a default checklist but allows for customization. What are some common addendums that you have seen?\n\nHave you received pushback on any of the default items?\n\n\n\nHow does Deon simplify communication around ethics across team boundaries?\nWhat are some of the most often overlooked items?\nWhat are some of the most difficult ethical concerns to comply with for a typical data science project?\nHow has Deon helped you at Driven Data?\nWhat are the customer facing impacts of embedding a discussion of ethics in the product development process?\nSome of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?\nWhat are your hopes for the future of the Deon project?\n\n\nKeep In Touch\n\n\nEmily\n\nLinkedIn\nejm714 on GitHub\n\n\n\nPeter\n\n\nLinkedIn\n@pjbull on Twitter\npjbull on GitHub\n\n\n\nDriven Data\n\n\n@drivendataorg on Twitter\ndrivendataorg on GitHub\nWebsite\n\n\n\n\n\nPicks\n\n\nTobias\n\nRichard Bond Glass Art\n\n\n\nEmily\n\n\nTandem Coffee in Portland, Maine\n\n\n\nPeter\n\n\nThe Model Bakery in Saint Helena and Napa, California\n\n\n\n\n\nLinks\n\n\nDeon\nDriven Data\nInternational Development\nBrookings Institution\nStata\nEconometrics\nMetis Bootcamp\nPandas\n\nPodcast Episode\n\n\n\nC#\n.NET\nPodcast.__init__ Episode On Software Ethics\nJupyter Notebook\n\n\nPodcast Episode\n\n\n\nWord2Vec\ncookiecutter data science\nLogistic Regression\n\n\nThe intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Keep In Touch

\n\n

\n\n

Picks

\n\n

\n\n

Links

\n\n

\n\n

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Of Checklists, Ethics, and Data (Interview)","date_published":"2018-10-21T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a8e4484b-9ae0-4e70-80f3-2951bb7a72e3.mp3","mime_type":"audio/mpeg","size_in_bytes":32506206,"duration_in_seconds":2732}]},{"id":"podlove-2018-10-14t23:24:07+00:00-ff1cb5a6698d52b","title":"Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52","url":"https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52","content_text":"Summary\n\nWith the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Iceberg is and the motivation for creating it?\n\nWas the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?\n\n\n\nHow has the use of Iceberg simplified your work at Netflix?\nHow is the reference implementation architected and how has it evolved since you first began work on it?\n\n\nWhat is involved in deploying it to a user’s environment?\n\n\n\nFor someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?\n\n\nIs there a migration path for pre-existing tables into the Iceberg format?\n\n\n\nHow is schema evolution managed at the file level?\n\n\nHow do you handle files on disk that don’t contain all of the fields specified in a table definition?\n\n\n\nOne of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?\nWhat are the unique challenges posed by using S3 as the basis for a data lake?\n\n\nWhat are the benefits that outweigh the difficulties?\n\n\n\nWhat have been some of the most challenging or contentious details of the specification to define?\n\n\nWhat are some things that you have explicitly left out of the specification?\n\n\n\nWhat are your long-term goals for the Iceberg specification?\n\n\nDo you anticipate the reference implementation continuing to be used and maintained?\n\n\n\n\n\nContact Info\n\n\nrdblue on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nIceberg Reference Implementation\nIceberg Table Specification\nNetflix\nHadoop\nCloudera\nAvro\nParquet\nSpark\nS3\nHDFS\nHive\nORC\nS3mper\nGit\nMetacat\nPresto\nPig\nDDL (Data Definition Language)\nCost-Based Optimization\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)","date_published":"2018-10-14T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/72d14cd1-c100-48a6-a787-0f9826c34b5c.mp3","mime_type":"audio/mpeg","size_in_bytes":40284365,"duration_in_seconds":3225}]},{"id":"podlove-2018-10-09t12:06:14+00:00-f4ac3d0a394116c","title":"Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51","url":"https://www.dataengineeringpodcast.com/memsql-with-nikita-shamgunov-episode-51","content_text":"Summary\n\nOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nAnd the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what MemSQL is and how the product and business first got started?\nWhat are the typical use cases for customers running MemSQL?\nWhat are the benefits of integrating the ingestion pipeline with the database engine?\n\nWhat are some typical ways that the ingest capability is leveraged by customers?\n\n\n\nHow is MemSQL architected and how has the internal design evolved from when you first started working on it?\n\n\nWhere does it fall on the axes of the CAP theorem?\n\n\nHow much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?\n\nCan you describe the lifecycle of a write transaction?\n\n\n\n\n\nCan you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?\n\n\nHow do you mitigate the impact of network latency throughout the cluster during query planning and execution?\n\n\n\nHow much of the implementation of MemSQL is using custom built code vs. open source projects?\n\nWhat are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?\nWhat have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?\nWhen is MemSQL the wrong choice for a data platform?\nWhat do you have planned for the future of MemSQL?\n\n\nContact Info\n\n\n@nikitashamgunov on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nMemSQL\nNewSQL\nMicrosoft SQL Server\nSt. Petersburg University of Fine Mechanics And Optics\nC\nC++\nIn-Memory Database\nRAM (Random Access Memory)\nFlash Storage\nOracle DB\nPostgreSQL\n\nPodcast Episode\n\n\n\nKafka\nKinesis\nWealth Management\nData Warehouse\nODBC\nS3\nHDFS\nAvro\nParquet\nData Serialization Podcast Episode\nBroadcast Join\nShuffle Join\nCAP Theorem\nApache Arrow\nLZ4\nS2 Geospatial Library\nSybase\nSAP Hana\nKubernetes\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Fast, Scalable, and Flexible Data For Applications And Analytics On MemSQL (Interview)","date_published":"2018-10-09T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d1b62f72-a74f-4d32-9274-0aa2f3f84e67.mp3","mime_type":"audio/mpeg","size_in_bytes":44516583,"duration_in_seconds":3414}]},{"id":"podlove-2018-09-30t23:41:18+00:00-e77db96e522ee0c","title":"Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50","url":"https://www.dataengineeringpodcast.com/enigma-with-chris-groskopf-episode-50","content_text":"Summary\n\nThere are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you give a brief overview of what Enigma has built and what the motivation was for starting the company?\n\nHow do you define the concept of a knowledge graph?\n\n\n\nWhat are the processes involved in constructing a knowledge graph?\nCan you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?\nWhat are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?\n\n\nHow do you manage the software lifecycle for your ETL code?\nWhat kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?\n\n\n\nWhat are the current challenges that you are facing in building and scaling your data infrastructure?\n\n\nHow does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?\nWhat techniques are you using to manage accuracy and consistency in the data that you ingest?\n\n\n\nCan you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?\nWhat are the weak spots in your platform that you are planning to address in upcoming projects?\n\n\nIf you were to start from scratch today, what would you have done differently?\n\n\n\nWhat are some of the most interesting or unexpected uses of your product that you have seen?\nWhat is in store for the future of Enigma?\n\n\nContact Info\n\n\nEmail\nTwitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nEnigma\nChicago Tribune\nNPR\nQuartz\nCSVKit\nAgate\nKnowledge Graph\nTaxonomy\nConcourse\nAirflow\nDocker\nS3\nData Lake\nParquet\n\nPodcast Episode\n\n\n\nSpark\nAWS Neptune\nAWS Batch\nMoney Laundering\nJupyter Notebook\nPapermill\nJupytext\nCauldron: The Un-Notebook\n\n\nPodcast.__init__ Episode\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"The Data Engineering Behind A Real-World Knowledge Graph (Interview)","date_published":"2018-09-30T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ed5c6ea6-36db-4792-8e31-a95276b6ae01.mp3","mime_type":"audio/mpeg","size_in_bytes":36478263,"duration_in_seconds":3172}]},{"id":"podlove-2018-09-24t02:17:29+00:00-090c9d962100c0f","title":"A Primer On Enterprise Data Curation with Todd Walter - Episode 49","url":"https://www.dataengineeringpodcast.com/a-primer-on-enterprise-data-curation-with-todd-walter-episode-49","content_text":"Summary\nAs your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nHow do you define data curation?\n\nWhat are some of the high level concerns that are encapsulated in that effort?\n\n\nHow does the size and maturity of a company affect the ways that they architect and interact with their data systems?\nCan you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?\nWhat are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?\nWhat has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?\nAs “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?\nIn terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?\n\nWhat is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?\n\n\nOnce an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?\nETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?\nWhat are some of the areas of data architecture and curation that are most often forgotten or ignored?\nWhat resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTeradata\nData Architecture\nData Curation\nData Warehouse\nChief Data Officer\nETL (Extract, Transform, Load)\nData Lake\nMetadata\nData Lineage\n\nData Provenance\n\n\nStrata Conference\nELT (Extract, Load, Transform)\nMap-Reduce\nHive\nPig\nSpark\nData Governance\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n","content_html":"

Summary

\n

As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

\n

Preamble

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n\n

\"\"

","summary":"Big Data Curation Strategies (Interview)","date_published":"2018-09-23T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/dfabf935-e94b-4d06-880d-7c3457c34534.mp3","mime_type":"audio/mpeg","size_in_bytes":33392970,"duration_in_seconds":2975}]},{"id":"podlove-2018-09-16t22:55:31+00:00-11b782cbf0d3955","title":"Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48","url":"https://www.dataengineeringpodcast.com/snowplow-with-alexander-dean-episode-48","content_text":"Summary\n\nEvery business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nThis is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics\n\n\nInterview\n\n\nIntroductions\nHow did you get involved in the area of data engineering and data management?\nWhat is Snowplow Analytics and what problem were you trying to solve when you started the company?\nWhat is unique about customer event data from an ingestion and processing perspective?\nChallenges with properly matching up data between sources\nData collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?\n\nCleanliness/accuracy\n\n\n\nWhat kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?\nCan you describe the overall architecture of the ingest pipeline that Snowplow provides?\n\n\nHow has that architecture evolved from when you first started?\nWhat would you do differently if you were to start over today?\n\n\n\nEnsuring appropriate use of enrichment sources\nWhat have been some of the biggest challenges encountered while building and evolving Snowplow?\nWhat are some of the most interesting uses of your platform that you are aware of?\n\n\nKeep In Touch\n\n\nAlex\n\n@alexcrdean on Twitter\nLinkedIn\n\n\n\nSnowplow\n\n\n@snowplowdata on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nSnowplow\n\nGitHub\n\n\n\nDeloitte Consulting\nOpenX\nHadoop\nAWS\nEMR (Elastic Map-Reduce)\nBusiness Intelligence\nData Warehousing\nGoogle Analytics\nCRM (Customer Relationship Management)\nS3\nGDPR (General Data Protection Regulation)\nKinesis\nKafka\nGoogle Cloud Pub-Sub\nJSON-Schema\nIglu\nIAB Bots And Spiders List\nHeap Analytics\n\n\nPodcast Interview\n\n\n\nRedshift\nSnowflakeDB\nSnowplow Insights\nGoogle Cloud Platform\nAzure\nGitLab\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Keep In Touch

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Taking Ownership Of Your Web Analytics With Snowplow (Interview)","date_published":"2018-09-16T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/18448c4b-3afb-4f2a-a84d-03bcfaf9546b.mp3","mime_type":"audio/mpeg","size_in_bytes":31591472,"duration_in_seconds":2868}]},{"id":"podlove-2018-09-10t01:18:37+00:00-ff982db60725263","title":"Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47","url":"https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47","content_text":"Summary\n\nElasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?\n\nWhat types of data are you focused on supporting?\nWhat are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?\n\n\n\nIs there any need for an Elasticsearch cluster in addition to Chaos Search?\nFor someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3?\nWhat are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL?\nGiven that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS?\nWhat mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster?\nWhat is the system architecture that you have built to allow for querying terabytes of data in S3?\n\n\nWhat are the biggest contributors to query latency and what have you done to mitigate them?\n\n\n\nWhat are the options for access control when running queries against the data stored in S3?\nWhat are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen?\nWhat are your plans for the future of Chaos Search?\n\n\nContact Info\n\n\nPete Cheslock\n\n@petecheslock on Twitter\nWebsite\n\n\n\nThomas Hazel\n\n\n@thomashazel on Twitter\nLinkedIn\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nChaos Search\nAWS S3\nCassandra\nElasticsearch\n\nPodcast Interview\n\n\n\nPostgreSQL\nDistributed Systems\nInformation Theory\nLucene\nInverted Index\nKibana\nLogstash\nNVMe\nAWS KMS\nKinesis\nFluentD\nParquet\nAthena\nPresto\nDrill\nBackblaze\nOpenStack Swift\nMinio\nEMR\nDataDog\nNewRelic\nElastic Beats\nMetricbeat\nGraphite\nSnappy\nScala\nAkka\nElastalert\nTensorflow\nX-Pack\nData Lake\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Using Chaos Search To Make Long Term Log Storage Affordable And Useful (Interview)","date_published":"2018-09-09T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fe0b97ef-ca51-4449-9380-bd69d4e10feb.mp3","mime_type":"audio/mpeg","size_in_bytes":30805138,"duration_in_seconds":2888}]},{"id":"podlove-2018-09-03t16:44:39+00:00-47c593fa2092739","title":"An Agile Approach To Master Data Management with Mark Marinelli - Episode 46","url":"https://www.dataengineeringpodcast.com/an-agile-approach-to-master-data-management-with-mark-marinelli-episode-46","content_text":"Summary\n\nWith the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by establishing a definition of data mastering that we can work from?\n\nHow does the master data set get used within the overall analytical and processing systems of an organization?\n\n\n\nWhat is the traditional workflow for creating a master data set?\n\n\nWhat has changed in the current landscape of businesses and technology platforms that makes that approach impractical?\nWhat are the steps that an organization can take to evolve toward an agile approach to data mastering?\n\n\n\nAt what scale of company or project does it makes sense to start building a master data set?\nWhat are the limitations of using ML/AI to merge data sets?\nWhat are the limitations of a golden master data set in practice?\n\n\nAre there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?\nAre there specific problem domains that are more likely to benefit from a master data set?\n\n\n\nOnce a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) \nWhat storage mechanisms are typically used for managing a master data set?\n\n\nAre there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?\nHow do you manage latency issues when trying to reference the same entities from multiple disparate systems?\n\n\n\nWhat have you found to be the most common stumbling blocks for a group that is implementing a master data platform?\n\n\nWhat suggestions do you have to help prevent such a project from being derailed?\n\n\n\nWhat resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTamr\nMulti-Dimensional Database\nMaster Data Management\nETL\nEDW (Enterprise Data Warehouse)\nWaterfall Development Method\nAgile Development Method\nDataOps\nFeature Engineering\nTableau\nQlik\nData Catalog\nPowerBI\nRDBMS (Relational Database Management System)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Building A Master Data Catalog Using Machine Learning (Interview)","date_published":"2018-09-03T14:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4d29958c-7881-42fa-8c71-2b1e5dccb6f9.mp3","mime_type":"audio/mpeg","size_in_bytes":34971487,"duration_in_seconds":2836}]},{"id":"podlove-2018-08-27t00:16:49+00:00-a46e7e01c804851","title":"Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45","url":"https://www.dataengineeringpodcast.com/enveil-with-ellison-anne-williams-episode-45","content_text":"Summary\n\nThere are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data security?\nCan you start by explaining what your mission is with Enveil and how the company got started?\nOne of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?\n\nWhat are some of the challenges associated with scaling homomorphic encryption?\nWhat are some difficulties associated with working on encrypted data sets?\n\n\n\nCan you describe the underlying architecture for your data platform?\n\n\nHow has that architecture evolved from when you first began building it?\n\n\n\nWhat are some use cases that are unlocked by having a fully encrypted data platform?\nFor someone using the Enveil platform, what does their workflow look like?\nA major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?\nWhat are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)\nWhat do you have planned for the future of Enveil?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data security today?\n\n\nLinks\n\n\nEnveil\nNSA\nGDPR\nIntellectual Property\nZero Trust\nHomomorphic Encryption\nCiphertext\nHadoop\nPII (Personally Identifiable Information)\nTLS (Transport Layer Security)\nSpark\nElasticsearch\nSide-channel attacks\nSpectre and Meltdown\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Using Homomorphic Encryption In Production With Enveil (Interview)","date_published":"2018-08-27T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f1783c50-fe4b-4eea-957e-a6efd4645836.mp3","mime_type":"audio/mpeg","size_in_bytes":26442491,"duration_in_seconds":1481}]},{"id":"podlove-2018-08-20t03:15:13+00:00-95881c7e7bccea6","title":"Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44","url":"https://www.dataengineeringpodcast.com/dgraph-with-manish-jain-episode-44","content_text":"Summary\n\nThe way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nIf you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.\nPython has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience\nas a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is DGraph and what motivated you to build it?\nGraph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?\n\nThe graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?\n\n\n\nWhat are some of the common uses of graph storage systems?\n\n\nWhat are some potential uses that are often overlooked?\n\n\n\nThere are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?\nHow does the query interface and data storage in DGraph differ from other options?\n\n\nWhat are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?\n\n\n\nHow is DGraph architected and how has that architecture evolved from when it first started?\nHow do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?\nIn your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?\nWhat are the limitations of DGraph in terms of scalability or usability?\nWhere does it fall along the axes of the CAP theorem?\nFor someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?\nWhat have been the most challenging aspects of building and growing the DGraph project and community?\nWhat are some of the most interesting or unexpected uses of DGraph that you are aware of?\nWhen is DGraph the wrong choice?\nWhat are your plans for the future of DGraph?\n\n\nContact Info\n\n\n@manishrjain on Twitter\nmanishrjain on GitHub\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDGraph\nBadger\nGoogle Knowledge Graph\nGraph Theory\nGraph Database\nSQL\nRelational Database\nNoSQL\nOLTP (On-Line Transaction Processing)\nNeo4J\nPostgreSQL\nMySQL\nBigTable\nRecommendation System\nFraud Detection\nCustomer 360\nUsenet Express\nIPFS\nGremlin\nCypher\nGSQL\nGraphQL\nMetaWeb\nRAFT\nSpanner\nHBase\nElasticsearch\nKubernetes\nTLS (Transport Layer Security)\nJepsen Tests\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"DGraph: A Fast, Distributed, Transactional Graph Database Built For Scale (Interview)","date_published":"2018-08-19T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/73c2b2c4-e0c4-42df-bc6d-8b40ee71b02a.mp3","mime_type":"audio/mpeg","size_in_bytes":29522736,"duration_in_seconds":2559}]},{"id":"podlove-2018-08-12t22:06:00+00:00-ed1aaaac65a74f3","title":"Putting Airflow Into Production With James Meickle - Episode 43","url":"https://www.dataengineeringpodcast.com/airflow-in-production-with-james-meickle-episode-43","content_text":"Summary\n\nThe theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your initial project requirement?\n\nWhat tooling did you consider in addition to Airflow?\nWhat aspects of the Airflow platform led you to choose it as your implementation target?\n\n\n\nCan you describe your current deployment architecture?\n\n\nHow many engineers are involved in writing tasks for your Airflow installation?\n\n\n\nWhat resources were the most helpful while learning about Airflow design patterns?\n\n\nHow have you architected your DAGs for deployment and extensibility?\n\n\n\nWhat kinds of tests and automation have you put in place to support the ongoing stability of your deployment?\nWhat are some of the dead-ends or other pitfalls that you encountered during the course of this project?\nWhat aspects of Airflow have you found to be lacking that you would like to see improved?\nWhat did you wish someone had told you before you started work on your Airflow installation?\n\n\nIf you were to start over would you make the same choice?\nIf Airflow wasn’t available what would be your second choice?\n\n\n\nWhat are your next steps for improvements and fixes?\n\n\nContact Info\n\n\n@eronarn on Twitter\nWebsite\neronarn on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nQuantopian\nHarvard Brain Science Initiative\nDevOps Days Boston\nGoogle Maps API\nCron\nETL (Extract, Transform, Load)\nAzkaban\nLuigi\nAWS Glue\nAirflow\nPachyderm\n\nPodcast Interview\n\n\n\nAirBnB\nPython\nYAML\nAnsible\nREST (Representational State Transfer)\nSAML (Security Assertion Markup Language)\nRBAC (Role-Based Access Control)\nMaxime Beauchemin\n\n\nMedium Blog\n\n\n\nCelery\nDask\n\n\nPodcast Interview\n\n\n\nPostgreSQL\n\n\nPodcast Interview\n\n\n\nRedis\nCloudformation\nJupyter Notebook\nQubole\nAstronomer\n\n\nPodcast Interview\n\n\n\nGunicorn\nKubernetes\nAirflow Improvement Proposals\nPython Enhancement Proposals (PEP)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Lessons Learned While Building A Data Science Platform With Airflow (Interview)","date_published":"2018-08-12T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5bc29d0b-9359-4d3a-830e-c33b5d368978.mp3","mime_type":"audio/mpeg","size_in_bytes":42226599,"duration_in_seconds":2885}]},{"id":"podlove-2018-08-06t05:28:43+00:00-074e03958020c5e","title":"Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42","url":"https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42","content_text":"Summary\n\nOne of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow did you get involved in the Postgres project?\nFor anyone who hasn’t used it, can you describe what PostgreSQL is?\n\nWhere did Postgres get started and how has it evolved over the intervening years?\n\n\n\nWhat are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?\n\n\nWhat are some cases where Postgres is the wrong choice?\n\n\n\nWhat are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience)\nThe recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities?\nWhat are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer?\nAre there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB?\nWhat is in store for the future of Postgres?\n\n\nContact Info\n\n\n@jkatz05 on Twitter\njkatz on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPostgreSQL\nCrunchy Data\nVenuebook\nPaperless Post\nLAMP Stack\nMySQL\nPHP\nSQL\nORDBMS\nEdgar Codd\nA Relational Model of Data for Large Shared Data Banks\nRelational Algebra\nOracle DB\nUC Berkeley\nDr. Michael Stonebraker\nIngres\nInformix\nQUEL\nANSI C\nCVS\nBSD License\nUUID\nJSON\nXML\nHStore\nPostGIS\nBTree Index\nGIN Index\nGIST Index\nKNN GIST\nSPGIST\nFull Text Search\nBRIN Index\nWAL (Write-Ahead Log)\nSQLite\nPGAdmin\nVim\nEmacs\nLinux\nOLAP (Online Analytical Processing)\nPostgres IRC\nPostgres Slack\nPostgres Conferences\nUPSERT\nPostgres Roadmap\nCockroachDB\n\nPodcast Interview\n\n\n\nCitus Data\n\n\nPodcast Interview\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"A Whirlwind Tour Of The PostgreSQL Database (Interview)","date_published":"2018-08-06T01:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/725398d5-15ad-41bc-ba91-81b04b477869.mp3","mime_type":"audio/mpeg","size_in_bytes":60971550,"duration_in_seconds":3381}]},{"id":"podlove-2018-07-30t15:51:40+00:00-1181d5a60ee076c","title":"Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41","url":"https://www.dataengineeringpodcast.com/canopy-and-ona-with-peter-lubell-doughtie-episode-41","content_text":"Summary\n\nWith the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Ona and how did the company get started?\n\nWhat are some examples of the types of customers that you work with?\n\n\n\nWhat types of data do you support in your collection platform?\nWhat are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?\nDoes your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?\nWhat are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?\nCan you describe the flow of the data from collection through to analysis?\nTo help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?\n\n\nWhat are the architectural considerations that you factored in when designing it?\nWhat have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?\n\n\n\nWhat are your plans for the future of Ona and Canopy?\n\n\nContact Info\n\n\nEmail\npld on Github\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nOpenSRP\nOna\nCanopy\nOpen Data Kit\nEarth Institute at Columbia University\nSustainable Engineering Lab\nWHO\nBill and Melinda Gates Foundation\nXLSForms\nPostGIS\nKafka\nDruid\nSuperset\nPostgres\nAnsible\nDocker\nTerraform\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Collecting And Analysing Data At Human Scale With Ona And Canopy (Interview)","date_published":"2018-07-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/59df7eb3-a9e8-4f88-a85b-46e9186a9321.mp3","mime_type":"audio/mpeg","size_in_bytes":29581328,"duration_in_seconds":1754}]},{"id":"podlove-2018-07-16t02:12:22+00:00-f6728943f647dd7","title":"Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40","url":"https://www.dataengineeringpodcast.com/ceph-with-sage-weil-episode-40","content_text":"Summary\n\nWhen working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start with an overview of what Ceph is?\n\nWhat was the motivation for starting the project?\nWhat are some of the most common use cases for Ceph?\n\n\n\nThere are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)?\nGiven that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions?\n\n\nWhat mechanisms are available to ensure data integrity across the cluster?\n\n\n\nHow is Ceph implemented and how has the design evolved over time?\nWhat is required to deploy and manage a Ceph cluster?\n\n\nWhat are the scaling factors for a cluster?\nWhat are the limitations?\n\n\n\nHow does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files?\nIn services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster?\nIn what situations would you advise someone against using Ceph?\nWhat are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community?\nWhat are some of the plans that you have for the future of Ceph?\n\n\nContact Info\n\n\nEmail\n@liewegas on Twitter\nliewegas on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCeph\nRed Hat\nDreamHost\nUC Santa Cruz\nLos Alamos National Labs\nDream Objects\nOpenStack\nProxmox\nPOSIX\nGlusterFS\nHadoop\nCeph Architecture\nPaxos\nrelatime\nPrometheus\nZabbix\nKubernetes\nNVMe\nDNS-SD\nConsul\nEtcD\nDNS SRV Record\nZeroconf\nBluestore\nXFS\nErasure Coding\nNFS\nSeastar\nRook\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Using Ceph For Highly Available, Scalable, And Flexible File Storage (Interview)","date_published":"2018-07-15T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5e1d35b4-b3e4-4932-9743-1e808a7c4dc9.mp3","mime_type":"audio/mpeg","size_in_bytes":28307667,"duration_in_seconds":2910}]},{"id":"podlove-2018-07-08t21:41:04+00:00-1452c870dfe3f18","title":"Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39","url":"https://www.dataengineeringpodcast.com/nifi-with-kevin-doran-and-andy-lopresto-episode-39","content_text":"Summary\n\nData integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what NiFi is?\nWhat is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code?\nHow did you get involved with the project?\n\nWhere does it sit in the broader landscape of data tools?\n\n\n\nDoes the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?\n\n\nHow do you manage versioning and backup of data flows, as well as promoting them between environments?\n\n\n\nOne of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?\n\n\nWhat types of reporting are available across this information?\n\n\n\nWhat are some of the use cases or requirements that lend themselves well to being solved by NiFi?\n\n\nWhen is NiFi the wrong choice?\n\n\n\nWhat is involved in deploying and scaling a NiFi installation?\n\n\nWhat are some of the system/network parameters that should be considered?\nWhat are the scaling limitations?\n\n\n\nWhat have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community?\nWhat do you have planned for the future of NiFi?\n\n\nContact Info\n\n\nKevin Doran\n\n@kevdoran on Twitter\nEmail\n\n\n\nAndy LoPresto\n\n\n@yolopey on Twitter\nEmail\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nNiFi\nHortonWorks DataFlow\nHortonWorks\nApache Software Foundation\nApple\nCSV\nXML\nJSON\nPerl\nPython\nInternet Scale\nAsset Management\nDocumentum\nDataFlow\nNSA (National Security Agency)\n24 (TV Show)\nTechnology Transfer Program\nAgile Software Development\nWaterfall\nSpark\nFlink\nKafka\nOozie\nLuigi\nAirflow\nFluentD\nETL (Extract, Transform, and Load)\nESB (Enterprise Service Bus)\nMiNiFi\nJava\nC++\nProvenance\nKubernetes\nApache Atlas\nData Governance\nKibana\nK-Nearest Neighbors\nDevOps\nDSL (Domain Specific Language)\nNiFi Registry\nArtifact Repository\nNexus\nNiFi CLI\nMaven Archetype\nIoT\nDocker\nBackpressure\nNiFi Wiki\nTLS (Transport Layer Security)\nMozilla TLS Observatory\nNiFi Flow Design System\nData Lineage\nGDPR (General Data Protection Regulation)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Self Service Data Flows With Apache NiFi (Interview)","date_published":"2018-07-08T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6eebfae3-4151-4d16-90ae-6154bbc0c5dd.mp3","mime_type":"audio/mpeg","size_in_bytes":48480013,"duration_in_seconds":3855}]},{"id":"podlove-2018-07-02t05:03:53+00:00-3b5fa253c04d24f","title":"Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38","url":"https://www.dataengineeringpodcast.com/alegion-with-cheryl-martin-episode-38","content_text":"Summary\n\nData is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nTo start, can you explain the problem space that Alegion is targeting and how you operate?\nWhen is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?\nWhat are some of the biggest challenges associated with managing human input to data sets intended for machine usage?\nFor someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?\n\nWhat tools and processes do you have in place to ensure the accuracy of their inputs?\nHow do you prevent bad actors from contributing data that would compromise the trained model?\n\n\n\nWhat are the limitations of crowd-sourced data labels?\n\n\nWhen is it beneficial to incorporate domain experts in the process?\n\n\n\nWhen doing data collection from various sources, how do you ensure that intellectual property rights are respected?\nHow do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?\n\n\nWhat kinds of metadata do you track and how is that recorded/transmitted?\n\n\n\nDo you think that human intelligence will be a necessary piece of ML/AI forever?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nAlegion\nUniversity of Texas at Austin\nCognitive Science\nLabeled Data\nMechanical Turk\nComputer Vision\nSentiment Analysis\nSpeech Recognition\nTaxonomy\nFeature Engineering\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Integrating Crowd Scale Human Intelligence In AI Projects (Interview)","date_published":"2018-07-02T01:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/112c8195-4e09-440f-92b2-7901190b2536.mp3","mime_type":"audio/mpeg","size_in_bytes":26961772,"duration_in_seconds":2773}]},{"id":"podlove-2018-06-25t02:26:24+00:00-8af85282e6849b0","title":"Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37","url":"https://www.dataengineeringpodcast.com/quilt-data-with-kevin-moore-episode-37","content_text":"Summary\n\nCollaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is the intended use case for Quilt and how did the project get started?\nCan you step through a typical workflow of someone using Quilt?\n\nHow does that change as you go from a single user to a team of data engineers and data scientists?\n\n\n\nCan you describe the elements of what a data package consists of?\n\n\nWhat was your criteria for the file formats that you chose?\n\n\n\nHow is Quilt architected and what have been the most significant changes or evolutions since you first started?\nHow is the data registry implemented?\n\n\nWhat are the limitations or edge cases that you have run into?\nWhat optimizations have you made to accelerate synchronization of the data to and from the repository?\n\n\n\nWhat are the limitations in terms of data volume, format, or usage?\nWhat is your goal with the business that you have built around the project?\nWhat are your plans for the future of Quilt?\n\n\nContact Info\n\n\nEmail\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nQuilt Data\nGitHub\nJobs\nReproducible Data Dependencies in Jupyter\nReproducible Machine Learning with Jupyter and Quilt\nAllen Institute: Programmatic Data Access with Quilt\nQuilt Example: MissingNo\nOracle\nPandas\nJupyter\nYcombinator\nData.World\n\nPodcast Episode with CTO Bryon Jacob\n\n\n\nKaggle\nParquet\nHDF5\nArrow\nPySpark\nExcel\nScala\nBinder\nMerkle Tree\nAllen Institute for Cell Science\nFlask\nPostGreSQL\nDocker\nAirflow\nQuilt Teams\nHive\nHive Metastore\nPrestoDB\n\n\nPodcast Episode\n\n\n\nNetflix Iceberg\nKubernetes\nHelm\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Quilt: The Package Manager And Repository For Your Data (Interview)","date_published":"2018-06-24T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/12fffccf-8bb9-4892-b07d-a243ea3264bd.mp3","mime_type":"audio/mpeg","size_in_bytes":22428631,"duration_in_seconds":2503}]},{"id":"podlove-2018-06-17t14:00:33+00:00-e451c6b0bdd6f51","title":"User Analytics In Depth At Heap with Dan Robinson - Episode 36","url":"https://www.dataengineeringpodcast.com/heap-with-dan-robinson-episode-36","content_text":"Summary\n\nWeb and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving a brief overview of Heap?\nOne of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?\nCan you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?\nData collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?\n\nWhat are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?\n\n\n\nWhat is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?\n\n\nWhat challenges does that pose in your processing architecture?\n\n\n\nWhat are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?\n\n\nHow has that architecture changed or evolved over the life of the company?\nWhat are some changes that you are anticipating in the near future?\n\n\n\nCan you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?\nWhat are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?\nWhat changes have been necessary as a result of GDPR?\nWhat are your plans for the future of Heap?\n\n\nContact Info\n\n\n\n@danlovesproofs on twitter\ndan@drob.us\n@drob on github\nheapanalytics.com / @heap on twitter\nhttps://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHeap\nPalantir\nUser Analytics\nGoogle Analytics\nPiwik\nMixpanel\nHubspot\nJepsen\nChaos Engineering\nNode.js\nKafka\nScala\nCitus\nReact\nMobX\nRedshift\nHeap SQL\nBigQuery\nWebhooks\nDrip\nData Virtualization\nDNS\nPII\nSOC2\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Heap's Data Infrastructure In Depth (Interview)","date_published":"2018-06-17T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e1800a0f-88cf-46a4-8e09-4fb7a55745d8.mp3","mime_type":"audio/mpeg","size_in_bytes":34314477,"duration_in_seconds":2727}]},{"id":"podlove-2018-06-11t01:48:37+00:00-fa0e6dedf43d653","title":"CockroachDB In Depth with Peter Mattis - Episode 35","url":"https://www.dataengineeringpodcast.com/cockroachdb-with-peter-mattis-episode-35","content_text":"Summary\n\nWith the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was the motivation for creating CockroachDB and building a business around it?\nCan you describe the architecture of CockroachDB and how it supports distributed ACID transactions?\n\nWhat are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions?\nWhat are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?\n\n\n\nGo is an unconventional language for building a database. What are the pros and cons of that choice?\nWhat are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?\n\n\nWhat are the edge cases and failure modes that users should be aware of?\n\n\n\nI know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?\n\n\nWhat are some examples of extensions that are specific to CockroachDB?\n\n\n\nWhat are some of the most interesting uses of CockroachDB that you have seen?\nWhen is CockroachDB the wrong choice?\nWhat do you have planned for the future of CockroachDB?\n\n\nContact Info\n\n\nPeter\n\nLinkedIn\npetermattis on GitHub\n@petermattis on Twitter\n\n\n\nCockroach Labs\n\n\n@CockroackDB on Twitter\nWebsite\ncockroachdb on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCockroachDB\nCockroach Labs\nSQL\nGoogle Bigtable\nSpanner\nNoSQL\nRDBMS (Relational Database Management System)\n“Big Iron” (colloquial term for mainframe computers)\nRAFT Consensus Algorithm\nConsensus\nMVCC (Multiversion Concurrency Control)\nIsolation\nEtcd\nGDPR\nGolang\nC++\nGarbage Collection\nMetaprogramming\nRust\nStatic Linking\nDocker\nKubernetes\nCAP Theorem\nPostGreSQL\nORM (Object Relational Mapping)\nInformation Schema\nPG Catalog\nInterleaved Tables\nVertica\nSpark\nChange Data Capture\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2018-06-10T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/28ec5ffc-e2e6-47e8-b2c9-6fdd5ab81faa.mp3","mime_type":"audio/mpeg","size_in_bytes":30262358,"duration_in_seconds":2621}]},{"id":"podlove-2018-06-04t03:22:01+00:00-3094d37fdff0e40","title":"ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34","url":"https://www.dataengineeringpodcast.com/arangodb-fast-scalable-and-multi-model-data-storage-with-jan-steeman-and-jan-stucke-episode-34","content_text":"Summary\n\nUsing a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you give a high level description of what ArangoDB is and the motivation for creating it?\n\nWhat is the story behind the name?\n\n\n\nHow is ArangoDB constructed?\n\n\nHow does the underlying engine store the data to allow for the different ways of viewing it?\n\n\n\nWhat are some of the benefits of multi-model data storage?\n\n\nWhen does it become problematic?\n\n\n\nFor users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango?\nHow does it compare to OrientDB?\nWhat are the options for scaling a running system?\n\n\nWhat are the limitations in terms of network architecture or data volumes?\n\n\n\nOne of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?\n\n\nWhat mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code?\nWhat are some of the most interesting or surprising uses of this functionality that you have seen?\n\n\n\nWhat are some of the most challenging technical and business aspects of building and promoting ArangoDB?\nWhat do you have planned for the future of ArangoDB?\n\n\nContact Info\n\n\nJan Steemann\n\njsteemann on GitHub\n@steemann on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nArangoDB\nKöln\nMulti-model Database\nGraph Algorithms\nApache 2\nC++\nArangoDB Foxx\nRaft Protocol\nTarget Partners\nRocksDB\nAQL (ArangoDB Query Language)\nOrientDB\nPostGreSQL\nOrientDB Studio\nGoogle Spanner\n3-Tier Architecture\nThomson-Reuters\nArango Search\nDell EMC\nGoogle S2 Index\nArangoDB Geographic Functionality\nJSON Schema\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Fast, Scalable, and Flexible Data Storage with ArangoDB (Interview)","date_published":"2018-06-03T23:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/171c3a59-3f12-4817-bfb8-dabfd676450f.mp3","mime_type":"audio/mpeg","size_in_bytes":30669129,"duration_in_seconds":2405}]},{"id":"podlove-2018-05-27t10:26:45+00:00-c4f1db2e56df9d0","title":"The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33","url":"https://www.dataengineeringpodcast.com/alooma-with-yair-weinberger-episode-33","content_text":"Summary\n\nBuilding an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Alooma and what is the origin story?\nHow is the Alooma platform architected?\n\nI want to go into stream VS batch here\nWhat are the most challenging components to scale?\n\n\n\nHow do you manage the underlying infrastructure to support your SLA of 5 nines?\nWhat are some of the complexities introduced by processing data from multiple customers with various compliance requirements?\n\n\nHow do you sandbox user’s processing code to avoid security exploits?\n\n\n\nWhat are some of the potential pitfalls for automatic schema management in the target database?\nGiven the large number of integrations, how do you maintain the\n\n\nWhat are some challenges when creating integrations, isn’t it simply conforming with an external API?\n\n\n\nFor someone getting started with Alooma what does the workflow look like?\nWhat are some of the most challenging aspects of building and maintaining Alooma?\nWhat are your plans for the future of Alooma?\n\n\nContact Info\n\n\nLinkedIn\n@yairwein on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nAlooma\nConvert Media\nData Integration\nESB (Enterprise Service Bus)\nTibco\nMulesoft\nETL (Extract, Transform, Load)\nInformatica\nMicrosoft SSIS\nOLAP Cube\nS3\nAzure Cloud Storage\nSnowflake DB\nRedshift\nBigQuery\nSalesforce\nHubspot\nZendesk\nSpark\nThe Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps\nRDBMS (Relational Database Management System)\nSaaS (Software as a Service)\nChange Data Capture\nKafka\nStorm\nGoogle Cloud PubSub\nAmazon Kinesis\nAlooma Code Engine\nZookeeper\nIdempotence\nKafka Streams\nKubernetes\nSOC2\nJython\nDocker\nPython\nJavascript\nRuby\nScala\nPII (Personally Identifiable Information)\nGDPR (General Data Protection Regulation)\nAmazon EMR (Elastic Map Reduce)\nSequoia Capital\nLightspeed Investors\nRedis\nAerospike\nCassandra\nMongoDB\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"The Alooma Cloud Data Pipeline Deep Dive (Interview)","date_published":"2018-05-27T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3bb7d627-af1b-4547-8234-59ea5679e873.mp3","mime_type":"audio/mpeg","size_in_bytes":34743742,"duration_in_seconds":2870}]},{"id":"podlove-2018-05-21t00:07:21+00:00-52148b622ee85a5","title":"PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32","url":"https://www.dataengineeringpodcast.com/prestodb-at-starburst-data-with-kamil-bajda-pawlikowski-episode-32","content_text":"Summary\n\nMost businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Presto is?\n\nWhat are some of the common use cases and deployment patterns for Presto?\n\n\n\nHow does Presto compare to Drill or Impala?\nWhat is it about Presto that led you to building a business around it?\nWhat are some of the most challenging aspects of running and scaling Presto?\nFor someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?\n\n\nHow does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?\n\n\n\nWhat are some cases in which Presto is not the right solution?\nWhat types of support have you found to be the most commonly requested?\nWhat are some of the types of tooling or improvements that you have made to Presto in your distribution?\n\n\nWhat are some of the notable changes that your team has contributed upstream to Presto?\n\n\n\n\n\nContact Info\n\n\nWebsite\nE-mail\nTwitter – @starburstdata\nTwitter – @prestodb\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nStarburst Data\nPresto\nHadapt\nHadoop\nHive\nTeradata\nPrestoCare\nCost Based Optimizer\nANSI SQL\nSpill To Disk\nTempto\nBenchto\nGeospatial Functions\nCassandra\nAccumulo\nKafka\nRedis\nPostGreSQL\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss","content_html":"

Summary

\n\n

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss\"\"

","summary":"Analyzing Your Data Lake With PrestoDB (Interview)","date_published":"2018-05-20T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f3317bc1-ff2f-487a-b061-812c877a36b9.mp3","mime_type":"audio/mpeg","size_in_bytes":26228639,"duration_in_seconds":2527}]},{"id":"podlove-2018-05-14t00:32:22+00:00-e74035130daddf1","title":"Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31","url":"https://www.dataengineeringpodcast.com/odsc-east-2018-part-2-episode-31","content_text":"Summary\n\nThe Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.\n\n\nInterview\n\nAndy Eschbacher From Carto\n\n\nWhat are the challenges associated with storing geospatial data?\nWhat are some of the common misconceptions that people have about working with geospatial data?\n\n\nContact Info\n\n\nandy-esch on GitHub\n@MrEPhysics on Twitter\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCarto\nGeospatial Analysis\nGeoJSON\n\n\nTodd Blaschka From TigerGraph\n\n\nWhat are graph databases and how do they differ from relational engines?\nWhat are some of the common difficulties that people have when deling with graph algorithms?\nHow does data modeling for graph databases differ from relational stores?\n\n\nContact Info\n\n\nLinkedIn\n@toddblaschka on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTigerGraph\nGraph Databases\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

Andy Eschbacher From Carto

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

Todd Blaschka From TigerGraph

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"A Brief Look At Geospatial Data And Graph Databases (Interview)","date_published":"2018-05-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8f6d4025-42d4-4884-845e-2395ee6153ec.mp3","mime_type":"audio/mpeg","size_in_bytes":20368614,"duration_in_seconds":1565}]},{"id":"podlove-2018-05-07t01:39:15+00:00-a2cfe6962d8320d","title":"Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30","url":"https://www.dataengineeringpodcast.com/odsc-east-2018-part-1-episode-30","content_text":"Summary\n\nThe Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.\n\n\nInterview\n\nAlan Anders from Applecart\n\n\nWhat are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?\nWhat are the biggest technical hurdles at Applecart?\n\n\nContact Info\n\n\n@alanjanders on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nSpark\nDataBricks\nDataBricks Delta\nApplecart\n\n\nStepan Pushkarev from Hydrosphere.io\n\n\nWhat is Hydropshere.io?\nWhat metrics do you track to determine when a machine learning model is not producing an appropriate output?\nHow do you determine which data points to sample for retraining the model?\nHow does the role of a machine learning engineer differ from data engineers and data scientists?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHydrosphere\nMachine Learning Engineer\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

Alan Anders from Applecart

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

Stepan Pushkarev from Hydrosphere.io

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Brief Conversations On Data Engineering From The Open Data Science Conference","date_published":"2018-05-06T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d184cf3e-cf15-4a19-a888-2783ee196c21.mp3","mime_type":"audio/mpeg","size_in_bytes":25280186,"duration_in_seconds":1958}]},{"id":"podlove-2018-04-30t01:02:56+00:00-fa79234da01d7a6","title":"Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29","url":"https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29","content_text":"Summary\n\nBusiness Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe current goal for most companies is to be “data driven”. How would you define that concept?\n\nHow does Metabase assist in that endeavor?\n\n\n\nWhat is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL?\n\n\nWhat level of complexity is possible with the query builder?\n\n\n\nWhat have you found to be the typical use cases for Metabase in the context of an organization?\nHow do you manage scaling for large or complex queries?\nWhat was the motivation for using Clojure as the language for implementing Metabase?\nWhat is involved in adding support for a new data source?\nWhat are the differentiating features of Metabase that would lead someone to choose it for their organization?\nWhat have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective?\nWhat do you have planned for the future of Metabase?\n\n\nContact Info\n\n\nSameer\n\nsalsakran on GitHub\n@sameer_alsakran on Twitter\nLinkedIn\n\n\n\nMetabase\n\n\nWebsite\n@metabase on Twitter\nmetabase on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nExpa\nMetabase\nBlackjet\nHadoop\nImeem\nMaslow’s Hierarchy of Data Needs\n2 Sided Marketplace\nHoneycomb Interview\nExcel\nTableau\nGo-JEK\nClojure\nReact\nPython\nScala\nJVM\nRedash\nHow To Lie With Data\nStripe\nBraintree Payments\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Self Service Business Intelligence For Everyone With Metabase (Interview)","date_published":"2018-04-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/712d9794-7d42-4251-9769-c55d5ee98c3e.mp3","mime_type":"audio/mpeg","size_in_bytes":34587427,"duration_in_seconds":2686}]},{"id":"podlove-2018-04-22t02:54:01+00:00-f106327a9fd1407","title":"Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28","url":"https://www.dataengineeringpodcast.com/octopai-with-amnon-drori-episode-28","content_text":"Summary\n\nThe information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is OctopAI and what was your motivation for founding it?\nWhat are some of the types of information that you classify and collect as metadata?\nCan you talk through the architecture of your platform?\nWhat are some of the challenges that are typically faced by metadata management systems?\nWhat is involved in deploying your metadata collection agents?\nOnce the metadata has been collected what are some of the ways in which it can be used?\nWhat mechanisms do you use to ensure that customer data is segregated?\n\nHow do you identify and handle sensitive information during the collection step?\n\n\n\nWhat are some of the most challenging aspects of your technical and business platforms that you have faced?\nWhat are some of the plans that you have for OctopAI going forward?\n\n\nContact Info\n\n\nAmnon\n\nLinkedIn\n@octopai_amnon on Twitter\n\n\n\nOctopAI\n\n\n@OctopaiBI on Twitter\nWebsite\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nOctopAI\nMetadata\nMetadata Management\nData Integrity\nCRM (Customer Relationship Management)\nERP (Enterprise Resource Planning)\nBusiness Intelligence\nETL (Extract, Transform, Load)\nInformatica\nSAP\nData Governance\nSSIS (SQL Server Integration Services)\nVertica\nAirflow\nLuigi\nOozie\nGDPR (General Data Privacy Regulation)\nRoot Cause Analysis\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Octopai Managed Metadata Service For Better Business Intelligence (Interview)","date_published":"2018-04-22T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7e2a04d9-8dc1-4426-8d70-b4832d3568f4.mp3","mime_type":"audio/mpeg","size_in_bytes":23520376,"duration_in_seconds":2392}]},{"id":"podlove-2018-04-15t03:10:47+00:00-4eb35f774e35852","title":"Data Engineering Weekly with Joe Crobak - Episode 27","url":"https://www.dataengineeringpodcast.com/data-engineering-weekly-with-joe-crobak-episode-27","content_text":"Summary\n\nThe rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are some of the projects that you have been involved in that were most personally fulfilling?\n\nAs an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data?\nHealthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?\n\n\n\nWhat was your motivation for starting a newsletter about the Hadoop space?\n\n\nCan you speak to your reasoning for the recent rebranding of the newsletter?\n\n\n\nHow much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it?\nAfter over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?\n\n\nWhat have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?\n\n\n\nWhat is your workflow for finding and curating the content that goes into your newsletter?\nWhat is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter?\nHow has your experience managing the newsletter influenced your areas of focus in your work and vice-versa?\nWhat are your plans going forward?\n\n\nContact Info\n\n\nData Eng Weekly\nEmail\nTwitter – @joecrobak\nTwitter – @dataengweekly\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nUSDS\nNational Labs\nCray\nAmazon EMR (Elastic Map-Reduce)\nRecommendation Engine\nNetflix Prize\nHadoop\nCloudera\nPuppet\nhealthcare.gov\nMedicare\nQuality Payment Program\nHIPAA\nNIST National Institute of Standards and Technology\nPII (Personally Identifiable Information)\nThreat Modeling\nApache JBoss\nApache Web Server\nMarkLogic\nJMS (Java Message Service)\nLoad Balancer\nCOBOL\nHadoop Weekly\nData Engineering Weekly\nFoursquare\nNiFi\nKubernetes\nSpark\nFlink\nStream Processing\nDataStax\nRSS\nThe Flavors of Data Science and Engineering\nCQRS\nChange Data Capture\nJay Kreps\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Keeping Up With The Data Engineering industry (Interview)","date_published":"2018-04-14T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c6ab869d-1981-4020-8553-15cdc7ebd1fb.mp3","mime_type":"audio/mpeg","size_in_bytes":29824154,"duration_in_seconds":2612}]},{"id":"podlove-2018-04-08t21:19:27+00:00-aa15772fb2a12ec","title":"Defining DataOps with Chris Bergh - Episode 26","url":"https://www.dataengineeringpodcast.com/datakitchen-dataops-with-chris-bergh-episode-26","content_text":"Summary\n\nManaging an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow do you define DataOps?\n\nHow does it compare to the practices encouraged by the DevOps movement?\nHow does it relate to or influence the role of a data engineer?\n\n\n\nHow does a DataOps oriented workflow differ from other existing approaches for building data platforms?\nOne of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments?\nThe practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system?\nOne of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal?\n\n\nIn order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments?\n\n\n\nHow does the DataKitchen platform simplify the process of operationalizing a data analytics workflow?\nAs the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed?\n\n\nContact Info\n\n\nLinkedIn\n@ChrisBergh on Twitter\nEmail\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDataOps Manifesto\nDataKitchen\n2017: The Year Of DataOps\nAir Traffic Control\nChief Data Officer (CDO)\nGartner\nW. Edwards Deming\nDevOps\nTotal Quality Management (TQM)\nInformatica\nTalend\nAgile Development\nCattle Not Pets\nIDE (Integrated Development Environment)\nTableau\nDelphix\nDremio\nPachyderm\nContinuous Delivery by Jez Humble and Dave Farley\nSLAs (Service Level Agreements)\nXKCD Image Recognition Comic\nAirflow\nLuigi\nDataKitchen Documentation\nContinuous Integration\nContinous Delivery\nDocker\nVersion Control\nGit\nLooker\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Building Better Analytics Using DataOps (Interview)","date_published":"2018-04-08T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aad5622b-be29-4965-9048-e61c78d064dc.mp3","mime_type":"audio/mpeg","size_in_bytes":40484381,"duration_in_seconds":3270}]},{"id":"podlove-2018-04-01t13:04:23+00:00-798a607d34e4b0a","title":"ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25","url":"https://www.dataengineeringpodcast.com/threatstack-with-pete-cheslock-and-patrick-cable-episode-25","content_text":"Summary\n\nCloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhy don’t you start by explaining what ThreatStack does?\n\nWhat was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?\n\n\n\nCan you describe the type(s) of data that you collect and how it is structured?\nWhat is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?\n\n\nHow do you ensure a consistent format of the information that you receive?\nHow do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended?\nHow much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?\n\n\n\nI understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?\n\n\nHow much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?\n\n\n\nHow do you ensure the security and provenance of the data that you collect as it traverses your infrastructure?\nWhat are some of the most common vulnerabilities that you detect in your client’s infrastructure?\nFor someone who wants to start using ThreatStack, what does the setup process look like?\nWhat have you found to be the most challenging aspects of building and managing the data processes in your environment?\nWhat are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?\n\n\nContact Info\n\n\nPete Cheslock\n\n@petecheslock on Twitter\nWebsite\npetecheslock on GitHub\n\n\n\nPatrick Cable\n\n\n@patcable on Twitter\nWebsite\npatcable on GitHub\n\n\n\nThreatStack\n\n\nWebsite\n@threatstack on Twitter\nthreatstack on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nThreatStack\nSecDevOps\nSonian\nEC2\nSnort\nSnorby\nSuricata\nTripwire\nSyscall (System Call)\nAuditD\nCloudTrail\nNaxsi\nCloud Native\nFile Integrity Monitoring (FIM)\nAmazon Web Services (AWS)\nRabbitMQ\nZeroMQ\nKafka\nSpark\nSlack\nPagerDuty\nJSON\nMicroservices\nCassandra\nElasticSearch\nSensu\nService Discovery\nHoneypot\nKubernetes\nPostGreSQL\nDruid\nFlink\nLaunch Darkly\nChef\nConsul\nTerraform\nCloudFormation\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Using Anomaly Detection To Secure Your Cloud with ThreatStack (Interview)","date_published":"2018-04-01T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/216f38c6-3c2b-4af2-bc9b-217d614c5571.mp3","mime_type":"audio/mpeg","size_in_bytes":35018801,"duration_in_seconds":3112}]},{"id":"podlove-2018-03-25t17:38:23+00:00-9f1f29a905e9bbb","title":"MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24","url":"https://www.dataengineeringpodcast.com/marketstore-with-hitoshi-harada-and-christopher-ryan-episode-24","content_text":"Summary\n\nThe data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your motivation for creating MarketStore?\nWhat are the characteristics of financial time series data that make it challenging to manage?\nWhat are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available?\nWith MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services?\n\nWhat is the worst case scenario if there is a total failure in the data store?\nWhat guards have you built to prevent such a situation from occurring?\n\n\n\nSince MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable?\nWhat were the most challenging aspects of building MarketStore and integrating it into the rest of your systems?\nMotivation for open sourcing the code?\nWhat is the next planned major feature for MarketStore, and what use-case is it aiming to support?\n\n\nContact Info\n\n\nChristopher\n\nEmail\n\n\n\nHitoshi\n\n\nEmail\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nMarketStore\n\nGitHub\nRelease Announcement\n\n\n\nAlpaca\nIBM\nDB2\nGreenPlum\nAlgorithmic Trading\nBacktesting\nOHLC (Open-High-Low-Close)\nHDF5\nGolang\nC++\nTimeseries Database List\nInfluxDB\nJSONRPC\nSlait\nCircleCI\nGDAX\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Fast and Scalable Financial Timeseries Dataframes with MarketStore (Interview)","date_published":"2018-03-25T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1951913b-6d96-4887-a3c4-5ccd80284599.mp3","mime_type":"audio/mpeg","size_in_bytes":23889511,"duration_in_seconds":2007}]},{"id":"podlove-2018-03-19t01:27:20+00:00-ff36314798bd73d","title":"Stretching The Elastic Stack with Philipp Krenn - Episode 23","url":"https://www.dataengineeringpodcast.com/elastic-stack-with-philipp-krenn-episode-23","content_text":"Summary\n\nSearch is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together?\nBeyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack?\nWhat are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data?\nWhat do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns?\nWhat are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise?\nWhat are the biggest challenges facing Elastic as a company in the near to medium term?\nOpen source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss\nWhat is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?\n\n\nContact Info\n\n\n@xeraa on Twitter\nxeraa on GitHub\nWebsite\nEmail\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nElastic\nVienna – Capital of Austria\nWhat Is Developer Advocacy?\nNoSQL\nMongoDB\nElasticsearch\nCassandra\nNeo4J\nHazelcast\nApache Lucene\nLogstash\nKibana\nBeats\nX-Pack\nELK Stack\nMetrics\nAPM (Application Performance Monitoring)\nGeoJSON\nSplit Brain\nElasticsearch Ingest Nodes\nPacketBeat\nElastic Cloud\nElasticon\nKibana Canvas\nSwiftType\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Exploring The Elastic Stack: From Text Search To Metrics Platform (Interview)","date_published":"2018-03-18T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/920cad85-4c9c-4f4d-a341-6e9ec27edc0e.mp3","mime_type":"audio/mpeg","size_in_bytes":40162826,"duration_in_seconds":3062}]},{"id":"podlove-2018-03-12t04:04:07+00:00-776468806187efe","title":"Database Refactoring Patterns with Pramod Sadalage - Episode 22","url":"https://www.dataengineeringpodcast.com/database-refactoring-patterns-with-pramod-sadalage-episode-22","content_text":"Summary\n\nAs software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject?\nWhat are the characteristics of a database that make them more difficult to manage in an iterative context?\nHow does the practice of refactoring in the context of a database compare to that of software?\nHow has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution?\nIs there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system?\nHow has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution?\nWhat have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system?\nLooking back over the past 12 years, what has changed in the areas of database design and evolution?\n\nHow has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases?\nWhat do you see as the biggest challenges facing us over the next few years?\n\n\n\n\n\nContact Info\n\n\nWebsite\npramodsadalage on GitHub\n@pramodsadalage on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDatabase Refactoring\n\nWebsite\nBook\n\n\n\nThoughtworks\nMartin Fowler\nAgile Software Development\nXP (Extreme Programming)\nContinuous Integration\n\n\nThe Book\nWikipedia\n\n\n\nTest First Development\nDDL (Data Definition Language)\nDML (Data Modification Language)\nDevOps\nFlyway\nLiquibase\nDBMaintain\nHibernate\nSQLAlchemy\nORM (Object Relational Mapper)\nODM (Object Document Mapper)\nNoSQL\nDocument Database\nMongoDB\nOrientDB\nCouchBase\nCassandraDB\nNeo4j\nArangoDB\nUnit Testing\nIntegration Testing\nOLAP (On-Line Analytical Processing)\nOLTP (On-Line Transaction Processing)\nData Warehouse\nDocker\nQA==Quality Assurance\nHIPAA (Health Insurance Portability and Accountability Act)\nPCI DSS (Payment Card Industry Data Security Standard)\nPolyglot Persistence\nToplink Java ORM\nRuby on Rails\nActiveRecord Gem\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Evolutionary Database Design and Refactoring (Interview)","date_published":"2018-03-12T00:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a7d6c3b1-a674-43f8-962f-9fd91b944e64.mp3","mime_type":"audio/mpeg","size_in_bytes":35393524,"duration_in_seconds":2945}]},{"id":"podlove-2018-03-05t02:24:19+00:00-b345c9d66278a00","title":"The Future Data Economy with Roger Chen - Episode 21","url":"https://www.dataengineeringpodcast.com/data-economy-with-roger-chen-episode-21","content_text":"Summary\n\nData is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?\nWhat are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?\nCan you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?\nMany companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?\nWhat kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?\nWhat do you view as being the privacy implications from creating and sharing these larger pools of data inventory?\nWhat do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?\nWith broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?\n\n\nContact Info\n\n\n@rgrchen on Twitter\nLinkedIn\nAngel List\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nElectrical Engineering\nBerkeley\nSilicon Nanophotonics\nData Liquidity In The Age Of Inference\nData Silos\nExample of a Data Commons Cooperative\nGoogle Maps Moat: An article describing how Google Maps has refined raw data to create a new product\nGenomics\nPhenomics\nImageNet\nOpen Data\nData Brokerage\nSmart Contracts\nIPFS\nDat Protocol\nHomomorphic Encryption\nFileCoin\nData Programming\nSnorkel\n\nWebsite\nPodcast Interview\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.

\n\n

Preamble

\n\n

\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Data Liquidity In The AI Economy (Interview)","date_published":"2018-03-04T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/872a6d49-12c4-40bd-b432-3de6551e1235.mp3","mime_type":"audio/mpeg","size_in_bytes":31343138,"duration_in_seconds":2567}]},{"id":"podlove-2018-02-26t03:57:29+00:00-32845a6df9b88aa","title":"Honeycomb Data Infrastructure with Sam Stokes - Episode 20","url":"https://www.dataengineeringpodcast.com/honeycomb-data-infrastructure-with-sam-stokes-episode-20","content_text":"Summary\n\nOne of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Honeycomb and how did you get started at the company?\nCan you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph?\nWhat are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale?\nIn addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns?\nA high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user?\nHow does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool?\nWhat have been some of the most challenging aspects of building, scaling, and marketing Honeycomb?\n\n\nContact Info\n\n\n@samstokes on Twitter\nBlog\nsamstokes on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHoneycomb\nRetriever\nMonitoring and Observability\nKafka\nColumn Oriented Storage\nElasticsearch\nElastic Stack\nDjango\nRuby on Rails\nHeroku\nKubernetes\nLaunch Darkly\nSplunk\nDatadog\nCynefin Framework\nGo-Lang\nTerraform\nAWS\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.

\n\n

Preamble

\n\n

\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Event Data Infrastructure at Honeycomb.io (Interview)","date_published":"2018-02-25T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6956dc44-b20e-47de-815c-d1c7ef7b919c.mp3","mime_type":"audio/mpeg","size_in_bytes":31455429,"duration_in_seconds":2493}]},{"id":"podlove-2018-02-19t00:58:11+00:00-391ba93d4336b97","title":"Data Teams with Will McGinnis - Episode 19","url":"https://www.dataengineeringpodcast.com/data-teams-with-will-mcginnis-episode-19","content_text":"Summary\n\nThe responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms?\nWhat parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators?\nIs there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team?\nWhat are the benefits of splitting the responsibilities of data engineering and data science?\n\nWhat are the disadvantages?\n\n\n\nWhat are some strategies to ensure successful interaction between data engineers and data scientists?\nHow do you view these roles evolving as they become more prevalent across companies and industries?\n\n\nContact Info\n\n\nWebsite\nwdm0006 on GitHub\n@willmcginniser on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nBlog Post: Tendencies of Data Engineers and Data Scientists\nPredikto\nCategorical Encoders\nDevOps\nSciKit-Learn\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.

\n\n

Preamble

\n\n

\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"How Data Teams Work Together (Interview)","date_published":"2018-02-18T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b0214a23-8a42-4afa-adcf-2678ac13c1d9.mp3","mime_type":"audio/mpeg","size_in_bytes":23195336,"duration_in_seconds":1718}]},{"id":"podlove-2018-02-11t15:57:44+00:00-4674d6e85b6a857","title":"TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18","url":"https://www.dataengineeringpodcast.com/timescaledb-with-ajay-kulkarni-and-mike-freedman-episode-18","content_text":"Summary\n\nAs communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Timescale is and how the project got started?\nThe landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options?\nIn your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices?\nHow is Timescale implemented and how has the internal architecture evolved since you first started working on it?\n\nWhat impact has the 10.0 release of PostGreSQL had on the design of the project?\nIs timescale compatible with systems such as Amazon RDS or Google Cloud SQL?\n\n\n\nFor someone who wants to start using Timescale what is involved in deploying and maintaining it?\nWhat are the axes for scaling Timescale and what are the points where that scalability breaks down?\n\n\nAre you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?\n\n\n\nWhat has been the most challenging aspect of building and marketing Timescale?\nWhen is Timescale the wrong tool to use for time series data?\nOne of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus?\nWhat are some of the most interesting uses of Timescale that you have seen?\nWhich came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health?\nWhat features or improvements do you have planned for future releases of Timescale?\n\n\nContact Info\n\n\nAjay\n\nLinkedIn\n@acoustik on Twitter\nTimescale Blog\n\n\n\nMike\n\n\nWebsite\nLinkedIn\n@michaelfreedman on Twitter\nTimescale Blog\n\n\n\nTimescale\n\n\nWebsite\n@timescaledb on Twitter\nGitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTimescale\nPostGreSQL\nCitus\nTimescale Design Blog Post\nMIT\nNYU\nStanford\nSDN\nPrinceton\nMachine Data\nTimeseries Data\nList of Timeseries Databases\nNoSQL\nOnline Transaction Processing (OLTP)\nObject Relational Mapper (ORM)\nGrafana\nTableau\nKafka\nWhen Boring Is Awesome\nPostGreSQL\nRDS\nGoogle Cloud SQL\nAzure DB\nDocker\nContinuous Aggregates\nStreaming Replication\nPGPool II\nKubernetes\nDocker Swarm\nCitus Data\n\nWebsite\nData Engineering Podcast Interview\n\n\n\nDatabase Indexing\nB-Tree Index\nGIN Index\nGIST Index\nSTE Energy\nRedis\nGraphite\nPrometheus\npg_prometheus\nOpenMetrics Standard Proposal\nTimescale Parallel Copy\nHadoop\nPostGIS\nKDB+\nDevOps\nInternet of Things\nMongoDB\nElastic\nDataBricks\nApache Spark\nConfluent\nNew Enterprise Associates\nMapD\nBenchmark Ventures\nHortonworks\n2σ Ventures\nCockroachDB\nCloudflare\nEMC\nTimescale Blog: Why SQL is beating NoSQL, and what this means for the future of data\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"TimescaleDB: Fast and Scalable Timeseries On PostGreSQL (Interview)","date_published":"2018-02-11T11:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/78092640-4638-4558-bd2b-a64c02d37f3a.mp3","mime_type":"audio/mpeg","size_in_bytes":49150487,"duration_in_seconds":3760}]},{"id":"podlove-2018-02-04t03:19:42+00:00-7126415d91bf89d","title":"Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17","url":"https://www.dataengineeringpodcast.com/pulsar-fast-and-scalable-messaging-with-rajan-dhabalia-and-matteo-merli-episode-17","content_text":"Summary\n\nOne of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Pulsar is and what the original inspiration for the project was?\nWhat have been some of the most challenging aspects of building and promoting Pulsar?\nFor someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?\nWhat are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?\nWhat projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?\nThe documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?\nOne of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?\nWhen is Pulsar the wrong tool to use?\nWhat are some of the improvements or new features that you have planned for the future of Pulsar?\n\n\nContact Info\n\n\nMatteo\n\nmerlimat on GitHub\n@merlimat on Twitter\n\n\n\nRajan\n\n\n@dhabaliaraj on Twitter\nrhabalia on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPulsar\nPublish-Subscribe\nYahoo\nStreamlio\nActiveMQ\nKafka\nBookkeeper\nSLA (Service Level Agreement)\nWrite-Ahead Log\nAnsible\nZookeeper\nPulsar Deployment Instructions\nRabbitMQ\nConfluent Schema Registry\n\nPodcast Interview\n\n\n\nKafka Connect\nWallaroo\n\n\nPodcast Interview\n\n\n\nKinesis\nAthenz\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

\n\n

Preamble

\n\n

\n\n

Interview

\n\n\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Fast, Globally Scalable Data Streaming with Pulsar (Interview)","date_published":"2018-02-03T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/290d7faf-865c-4170-a5c1-b0df56e2acc0.mp3","mime_type":"audio/mpeg","size_in_bytes":37072519,"duration_in_seconds":3226}]},{"id":"podlove-2018-01-29t02:19:14+00:00-646dfc6a5756548","title":"Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16","url":"https://www.dataengineeringpodcast.com/dat-with-danielle-robinson-and-joe-hand-episode-16","content_text":"Summary\nSharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\nYour host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is the Dat project and how did it get started?\nHow have the grants to the Dat project influenced the focus and pace of development that was possible?\n\nNow that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?\n\n\nCan you explain how the Dat protocol is designed and how it has evolved since it was first started?\nHow does Dat manage conflict resolution and data versioning when replicating between multiple machines?\nOne of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions?\nOne of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made?\nHow does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases?\nWhat mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default?\nFor someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network?\nWhat have been the most challenging aspects of building and promoting Dat?\nWhat are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?\n\nContact Info\n\nDat\n\ndatproject.org\nEmail\n@dat_project on Twitter\nDat Chat\n\n\nDanielle\n\nEmail\n@daniellecrobins\n\n\nJoe\n\nEmail\n@joeahand on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDat Project\nCode For Science and Society\nNeuroscience\nCell Biology\nOpenCon\nMozilla Science\nOpen Education\nOpen Access\nOpen Data\nFortune 500\nData Warehouse\nKnight Foundation\nAlfred P. Sloan Foundation\nGordon and Betty Moore Foundation\nDat In The Lab\nDat in the Lab blog posts\nCalifornia Digital Library\nIPFS\nDat on Open Collective – COMING SOON!\nScienceFair\nStencila\neLIFE\nGit\nBitTorrent\nDat Whitepaper\nMerkle Tree\nCertificate Transparency\nDat Protocol Working Group\nDat Multiwriter Development – Hyperdb\nBeaker Browser\nWebRTC\nIndexedDB\nRust\nC\nKeybase\nPGP\nWire\nZenodo\nDryad Data Sharing\nDataverse\nRSync\nFTP\nGlobus\nFritter\nFritter Demo\nRotonde how to\nJoe’s website on Dat\nDat Tutorial\nData Rescue – NYTimes Coverage\nData.gov\nLibraries+ Network\nUC Conservation Genomics Consortium\nFair Data principles\nhypervision\nhypervision in browser\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n Click here to read the unedited transcript…\n Tobias Macey 00:13\n Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure.\n Danielle Robinson 02:10\n My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe.\n Joe Hand 02:32\n Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now.\n Tobias Macey 02:42\n And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure.\n Danielle Robinson 02:48\n So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting.\n Tobias Macey 04:54\n And Joe, how did you get involved in data management?\n Joe Hand 04:57\n Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that.\n Tobias Macey 08:16\n So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well,\n Danielle Robinson 08:33\n yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month.\n Tobias Macey 11:27\n And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?\n Joe Hand 11:42\n Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol.\n Tobias Macey 13:10\n And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing?\n Joe Hand 13:44\n Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah,\n Danielle Robinson 14:41\n I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space.\n Tobias Macey 15:58\n And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?\n Danielle Robinson 16:09\n Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space.\n Tobias Macey 20:32\n And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves.\n Joe Hand 21:19\n Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now.\n I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems.\n Tobias Macey 30:24\n And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust?\n Joe Hand 31:27\n Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with.\n Tobias Macey 32:09\n And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior\n Danielle Robinson 32:33\n there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now.\n archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question.\n Tobias Macey 37:39\n And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort?\n Joe Hand 38:08\n Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network.\n Tobias Macey 40:59\n And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case?\n Joe Hand 41:20\n Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those.\n Danielle Robinson 42:29\n You mean, fritter not.\n freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms.\n Tobias Macey 43:13\n And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space?\n Joe Hand 43:47\n Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management.\n Tobias Macey 46:15\n And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks?\n Joe Hand 46:50\n Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope.\n Tobias Macey 48:46\n And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to?\n Danielle Robinson 49:05\n Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So\n Tobias Macey 50:58\n and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project,\n Danielle Robinson 51:10\n I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building.\n Joe Hand 53:03\n Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately.\n Tobias Macey 55:29\n And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share?\n Danielle Robinson 55:36\n Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up.\n Joe Hand 58:36\n And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out.\n Tobias Macey 59:09\n Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show?\n Danielle Robinson 59:15\n Um, I think I’m feeling pretty good. What about you, Joe?\n Joe Hand 59:18\n Yeah, I think that’s it for me. Okay.\n Tobias Macey 59:20\n So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today?\n Joe Hand 59:42\n I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah.\n Danielle Robinson 01:00:48\n Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us.\n Tobias Macey 01:02:30\n Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening.\n Unknown Speaker 01:02:48\n Thank you. Thank you.\n Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss\n \n\n\n","content_html":"

Summary

\n

Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

\n

Preamble

\n\n

Interview

\n\n

Contact Info

\n\n

Parting Question

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n\n
\n Click here to read the unedited transcript…\n

Tobias Macey 00:13

\n

Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure.

\n

Danielle Robinson 02:10

\n

My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe.

\n

Joe Hand 02:32

\n

Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now.

\n

Tobias Macey 02:42

\n

And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure.

\n

Danielle Robinson 02:48

\n

So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting.

\n

Tobias Macey 04:54

\n

And Joe, how did you get involved in data management?

\n

Joe Hand 04:57

\n

Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that.

\n

Tobias Macey 08:16

\n

So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well,

\n

Danielle Robinson 08:33

\n

yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month.

\n

Tobias Macey 11:27

\n

And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?

\n

Joe Hand 11:42

\n

Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol.

\n

Tobias Macey 13:10

\n

And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing?

\n

Joe Hand 13:44

\n

Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah,

\n

Danielle Robinson 14:41

\n

I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space.

\n

Tobias Macey 15:58

\n

And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?

\n

Danielle Robinson 16:09

\n

Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space.

\n

Tobias Macey 20:32

\n

And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves.

\n

Joe Hand 21:19

\n

Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now.

\n

I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems.

\n

Tobias Macey 30:24

\n

And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust?

\n

Joe Hand 31:27

\n

Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with.

\n

Tobias Macey 32:09

\n

And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior

\n

Danielle Robinson 32:33

\n

there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now.

\n

archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question.

\n

Tobias Macey 37:39

\n

And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort?

\n

Joe Hand 38:08

\n

Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network.

\n

Tobias Macey 40:59

\n

And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case?

\n

Joe Hand 41:20

\n

Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those.

\n

Danielle Robinson 42:29

\n

You mean, fritter not.

\n

freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms.

\n

Tobias Macey 43:13

\n

And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space?

\n

Joe Hand 43:47

\n

Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management.

\n

Tobias Macey 46:15

\n

And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks?

\n

Joe Hand 46:50

\n

Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope.

\n

Tobias Macey 48:46

\n

And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to?

\n

Danielle Robinson 49:05

\n

Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So

\n

Tobias Macey 50:58

\n

and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project,

\n

Danielle Robinson 51:10

\n

I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building.

\n

Joe Hand 53:03

\n

Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately.

\n

Tobias Macey 55:29

\n

And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share?

\n

Danielle Robinson 55:36

\n

Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up.

\n

Joe Hand 58:36

\n

And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out.

\n

Tobias Macey 59:09

\n

Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show?

\n

Danielle Robinson 59:15

\n

Um, I think I’m feeling pretty good. What about you, Joe?

\n

Joe Hand 59:18

\n

Yeah, I think that’s it for me. Okay.

\n

Tobias Macey 59:20

\n

So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today?

\n

Joe Hand 59:42

\n

I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah.

\n

Danielle Robinson 01:00:48

\n

Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us.

\n

Tobias Macey 01:02:30

\n

Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening.

\n

Unknown Speaker 01:02:48

\n

Thank you. Thank you.

\n

Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss

\n

\n
\n\n

\"\"

","summary":"Dat Project: Efficient Sharing of Versioned Data Sets (Interview)","date_published":"2018-01-28T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/728bb7df-7927-4fcf-9480-76e0ca1323a9.mp3","mime_type":"audio/mpeg","size_in_bytes":42572426,"duration_in_seconds":3778}]},{"id":"podlove-2018-01-22t04:15:13+00:00-bd90f6b1c0f50f2","title":"Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15","url":"https://www.dataengineeringpodcast.com/snorkel-with-alex-ratner-episode-15","content_text":"Summary\n\nThe majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your definition of dark data and how Snorkel helps to extract value from it?\nWhat are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes?\nCan you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale?\nFor someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights?\nHow is Snorkel architected and how has the design evolved over its lifetime?\nWhat are some situations where Snorkel would be poorly suited for use?\nWhat are some of the most interesting applications of Snorkel that you are aware of?\nWhat are some of the other projects that you and your group are working on that interact with Snorkel?\nWhat are some of the features or improvements that you have planned for future releases of Snorkel?\n\n\nContact Info\n\n\nWebsite\najratner on Github\n@ajratner on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nStanford\nDAWN\nHazyResearch\nSnorkel\nChristopher Ré\nDark Data\nDARPA\nMemex\nTraining Data\nFDA\nImageNet\nNational Library of Medicine\nEmpirical Studies of Conflict\nData Augmentation\nPyTorch\nTensorflow\nGenerative Model\nDiscriminative Model\nWeak Supervision\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Snorkel: Extracting Value From Dark Data With Python (Interview)","date_published":"2018-01-21T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/338aa66c-3ab5-43fe-8d58-cbbcc5b37a89.mp3","mime_type":"audio/mpeg","size_in_bytes":23528277,"duration_in_seconds":2232}]},{"id":"podlove-2018-01-15t01:07:25+00:00-64fbecf7a8498a6","title":"CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14","url":"https://www.dataengineeringpodcast.com/crdts-with-christopher-meiklejohn-episode-14","content_text":"Summary\n\nAs we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations?\nOther than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness?\nOne of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been?\nCan you provide examples of some production systems or available tools that are leveraging CRDTs?\nIf someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types?\nWhat areas of research are you most excited about right now?\nGiven that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed?\n\n\nContact Info\n\n\nWebsite\ncmeiklejohn on GitHub\nGoogle Scholar Citations\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nBasho\nRiak\nSyncfree\nLASP\nCRDT\nMesosphere\nCAP Theorem\nCassandra\nDynamoDB\nBayou System (Xerox PARC)\nMultivalue Register\nPaxos\nRAFT\nByzantine Fault Tolerance\nTwo Phase Commit\nSpanner\nReactiveX\nTensorflow\nErlang\nDocker\nKubernetes\nErleans\nOrleans\nAtom Editor\nAutomerge\nMartin Klepman\nAkka\nDelta CRDTs\nAntidote DB\nKops\nEventual Consistency\nCausal Consistency\nACID Transactions\nJoe Hellerstein\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)","date_published":"2018-01-14T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/61c189d7-9415-4806-91a1-9e3e2593f553.mp3","mime_type":"audio/mpeg","size_in_bytes":29084349,"duration_in_seconds":2743}]},{"id":"podlove-2018-01-08t01:01:23+00:00-487089bc2c64a77","title":"Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13","url":"https://www.dataengineeringpodcast.com/citus-data-with-ozgun-erdogan-and-craig-kerstiens-episode-13","content_text":"Summary\n\nPostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Citus is and how the project got started?\nWhy did you start with Postgres vs. building something from the ground up?\nWhat was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version?\nHow well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale?\nHow does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon?\nHow does Citus operate under the covers to enable clustering and replication across multiple hosts?\nWhat are the failure modes of Citus and how does it handle loss of nodes in the cluster?\nFor someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system?\nHow do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version?\nAre there any use cases that Citus enables which would be impractical to attempt in native Postgres?\nWhat have been some of the most challenging aspects of building the Citus extension?\nWhat are the situations where you would advise against using Citus?\nWhat are some of the most interesting or impressive uses of Citus that you have seen?\nWhat are some of the features that you have planned for future releases of Citus?\n\n\nContact Info\n\n\nCitus Data\n\ncitusdata.com\n@citusdata on Twitter\ncitusdata on GitHub\n\n\n\nCraig\n\n\nEmail\nWebsite\n@craigkerstiens on Twitter\n\n\n\nOzgun\n\n\nEmail\nozgune on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCitus Data\nPostGreSQL\nNoSQL\nTimescale SQL blog post\nPostGIS\nPostGreSQL Graph Database\nJSONB Data Type\nPipelineDB\nTimescale\nPostGres-XL\nAurora PostGres\nAmazon RDS\nStreaming Replication\nCitusMX\nCTE (Common Table Expression)\nHipMunk Citus Sharding Blog Post\nWal-e\nWal-g\nHeap Analytics\nHyperLogLog\nC-Store\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n

\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Scaling PostGreSQL for Big Data and Parallel Execution with Citus Data (Interview)","date_published":"2018-01-07T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/740992b7-24f5-4101-af9d-95ef879a51cd.mp3","mime_type":"audio/mpeg","size_in_bytes":35123320,"duration_in_seconds":2804}]},{"id":"podlove-2017-12-25t03:47:39+00:00-b66e6101d7f468c","title":"Wallaroo with Sean T. Allen - Episode 12","url":"https://www.dataengineeringpodcast.com/wallaroo-with-sean-t-allen-episode-12","content_text":"Summary\n\nData oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is Wallaroo and how did the project get started?\nWhat is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on?\nWhy did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented?\nHow is Wallaroo architected internally to allow for distributed state management?\n\nIs the state persistent, or is it only maintained long enough to complete the desired computation?\nIf so, what format do you use for long term storage of the data?\n\n\n\nWhat have been the most challenging aspects of building the Wallaroo platform?\nWhich axes of the CAP theorem have you optimized for?\nFor someone who wants to build an application on top of Wallaroo, what is involved in getting started?\nOnce you have a working application, what resources are necessary for deploying to production and what are the scaling factors?\n\n\nWhat are the failure modes that users of Wallaroo need to account for in their application or infrastructure?\n\n\n\nWhat are some situations or problem types for which Wallaroo would be the wrong choice?\nWhat are some of the most interesting or unexpected uses of Wallaroo that you have seen?\nWhat do you have planned for the future of Wallaroo?\n\n\nContact Info\n\n\nIRC\nMailing List\nWallaroo Labs Twitter\nEmail\nPersonal Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nWallaroo Labs\nStorm Applied\nApache Storm\nRisk Analysis\nPony Language\nErlang\nAkka\nTail Latency\nHigh Performance Computing\nPython\nApache Software Foundation\nBeyond Distributed Transactions: An Apostate’s View\nConsistent Hashing\nJepsen\nLineage Driven Fault Injection\nChaos Engineering\nQCon 2016 Talk\nCodemesh in London: How did I get here?\nCAP Theorem\nCRDT\nSync Free Project\nBasho\nWallaroo on GitHub\nDocker\nPuppet\nChef\nAnsible\nSaltStack\nKafka\nTCP\nDask\nData Engineering Episode About Dask\nBeowulf Cluster\nRedis\nFlink\nHaskell\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Fast and Scalable Real-Time Stream Computation with Wallaroo (Interview)","date_published":"2017-12-24T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/93644a6e-a334-4986-9855-bd725443f137.mp3","mime_type":"audio/mpeg","size_in_bytes":38436390,"duration_in_seconds":3553}]},{"id":"podlove-2017-12-18t01:05:08+00:00-5cc4317fde052d2","title":"SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11","url":"https://www.dataengineeringpodcast.com/siridb-with-jeroen-van-der-heijden-episode-11","content_text":"Summary\n\nTime series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database \n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is SiriDB and how did the project get started?\n\nWhat was the inspiration for the name?\n\n\n\nWhat was the landscape of time series databases at the time that you first began work on Siri?\nHow does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.?\nWhat do you view as the competition for Siri?\nHow is the server architected and how has the design evolved over the time that you have been working on it?\nCan you describe how the clustering mechanism functions?\n\n\nIs it possible to create pools with more than two servers?\n\n\n\nWhat are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem?\nIn the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon?\nOne of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging?\nWhat have been the most challenging aspects of building Siri?\nIn what situations or environments would you advise against using Siri?\n\n\nContact Info\n\n\njoente on Github\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nSiriDB\nOversight\nInfluxDB\nLevelDB\nOpenTSDB\nTimescale DB\nKairosDB\nWrite Ahead Log\nGrafana\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.

\n\n

Preamble

\n\n\n\n

Interview

\n\n

\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"SiriDB: Scalable Timeseries Database For Your System Metrics (Interview)","date_published":"2017-12-17T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/51ad4bd7-bcc1-4113-ab81-01066da679dc.mp3","mime_type":"audio/mpeg","size_in_bytes":21279629,"duration_in_seconds":2032}]},{"id":"podlove-2017-12-10t14:18:54+00:00-d3d8c2983023413","title":"Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10","url":"https://www.dataengineeringpodcast.com/confluent-schema-registry-with-ewen-cheslack-postava-episode-10","content_text":"Summary\n\nTo process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is the schema registry and what was the motivating factor for building it?\nIf you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas?\nHow did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options?\nConversely, what would be involved in using a storage backend other than Kafka?\nWhat are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure?\nWhat are some of the biggest challenges that you faced while designing and building the schema registry?\nWhat is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations?\nWhat are some of the features or enhancements that you have in mind for future work?\n\n\nContact Info\n\n\newencp on GitHub\nWebsite\n@ewencp on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nKafka\nConfluent\nSchema Registry\nSecond Life\nEve Online\nYes, Virginia, You Really Do Need a Schema Registry\nJSON-Schema\nParquet\nAvro\nThrift\nProtocol Buffers\nZookeeper\nKafka Connect\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"How Centralized Schemas Help Tame Distributed Streaming Analytics (Interview)","date_published":"2017-12-10T09:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ea534908-dfac-4dff-a31e-306c1ad7ba87.mp3","mime_type":"audio/mpeg","size_in_bytes":31598353,"duration_in_seconds":2961}]},{"id":"podlove-2017-12-03t03:11:07+00:00-fc2c042995799e7","title":"data.world with Bryon Jacob - Episode 9","url":"https://www.dataengineeringpodcast.com/data-dot-world-with-bryon-jacob-episode-9","content_text":"Summary\n\nWe have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world\n\n\nInterview\n\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat is data.world and what is its mission and how does your status as a B Corporation tie into that?\nThe platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched?\nWhat are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased?\nWhat are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use?\nHow do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform?\nWhat are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform?\nWhat are the projects or companies that you consider to be your competitors?\nWhat are some of the most interesting or unexpected uses of the data.world platform that you are aware of?\n\n\nContact Information\n\n\n@bryonjacob on Twitter\nbryonjacob on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\ndata.world\nHomeAway\nSemantic Web\nKnowledge Engineering\nOntology\nOpen Data\nRDF\nCSVW\nSPARQL\nDBPedia\nTriplestore\nHeader Dictionary Triples\nApache Jena\nTabula\nTableau Connector\nExcel Connector\nData For Democracy\nJonathan Morgan\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Information

\n\n\n\n

Parting Question

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"Data.World: The Platform For The Web Of Linked Data (Interview)","date_published":"2017-12-02T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/04c238c2-78a3-4d35-a9c3-a697fba495af.mp3","mime_type":"audio/mpeg","size_in_bytes":34176774,"duration_in_seconds":2784}]},{"id":"podlove-2017-11-22t11:22:25+00:00-d1acaf96838517e","title":"Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8","url":"https://www.dataengineeringpodcast.com/data-serialization-with-doug-cutting-and-julien-le-dem-episode-8","content_text":"Summary\nWith the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.\n\nInterview\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat are the main serialization formats used for data storage and analysis?\nWhat are the tradeoffs that are offered by the different formats?\nHow have the different storage and analysis tools influenced the types of storage formats that are available?\nYou’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort?\nWhy is it important for data engineers to carefully consider the format in which they transfer their data between systems?\n\nWhat are the switching costs involved in moving from one format to another after you have started using it in a production system?\n\n\nWhat are some of the new or upcoming formats that you are each excited about?\nHow do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?\n\nContact Information\n\nDoug:\n\ncutting on GitHub\nBlog\n@cutting on Twitter\n\n\nJulien\n\nEmail\n@J_ on Twitter\nBlog\njulienledem on GitHub\n\n\n\nLinks\n\nApache Avro\nApache Parquet\nApache Arrow\nHadoop\nApache Pig\nXerox Parc\nExcite\nNutch\nVertica\nDremel White Paper\n\nTwitter Blog on Release of Parquet\n\n\nCSV\nXML\nHive\nImpala\nPresto\nSpark SQL\nBrotli\nZStandard\nApache Drill\nTrevni\nApache Calcite\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n","content_html":"

Summary

\n

With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

\n

Preamble

\n\n

Interview

\n\n

Contact Information

\n\n

Links

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n\n

\"\"

","summary":"How To Choose A Data Interchange Format (Interview)","date_published":"2017-11-22T09:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d14a1435-9dc9-418d-9bb6-9b66198f1450.mp3","mime_type":"audio/mpeg","size_in_bytes":36500480,"duration_in_seconds":3103}]},{"id":"podlove-2017-11-14t21:10:09+00:00-61d90351f65b4c9","title":"Buzzfeed Data Infrastructure with Walter Menendez - Episode 7","url":"https://www.dataengineeringpodcast.com/buzzfeed-data-infrastructure-with-walter-menendez-episode-7","content_text":"Summary\n\nBuzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for?\nWhat are some of the types of data inputs and outputs that you work with at Buzzfeed?\nIs the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision?\nWhat does the architecture of your data platform look like and what are some of the most significant areas of technical debt?\nWhich platforms and languages are most widely leveraged in your team and what are some of the outliers?\nWhat are some of the most significant challenges that you face, both technically and organizationally?\nWhat are some of the dead ends that you have run into or failed projects that you have tried?\nWhat has been the most successful project that you have completed and how do you measure that success?\n\n\nContact Info\n\n\n@hackwalter on Twitter\nwalterm on GitHub\n\n\nLinks\n\n\nData Literacy\nMIT Media Lab\nTumblr\nData Capital\nData Infrastructure\nGoogle Analytics\nDatadog\nPython\nNumpy\nSciPy\nNLTK\nGo Language\nNSQ\nTornado\nPySpark\nAWS EMR\nRedshift\nTracking Pixel\nGoogle Cloud\nDon’t try to be google\nStop Hiring DevOps Engineers and Start Growing Them\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Info

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-11-14T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/98f95c5e-b083-44b8-abfa-3a50913fb5fb.mp3","mime_type":"audio/mpeg","size_in_bytes":17832626,"duration_in_seconds":2620}]},{"id":"podlove-2017-08-06t09:26:59+00:00-1a858828690c96f","title":"Astronomer with Ry Walker - Episode 6","url":"https://www.dataengineeringpodcast.com/astronomer-with-ry-walker-episode-6","content_text":"Summary\n\nBuilding a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.\n\n\nInterview\n\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat is Astronomer and how did it get started?\nRegulatory challenges of processing other people’s data\nWhat does your data pipelining architecture look like?\nWhat are the most challenging aspects of building a general purpose data management environment?\nWhat are some of the most significant sources of technical debt in your platform?\nCan you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them?\nThere are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management?\nWhat are some of the most interesting or unexpected uses of your platform that you are aware of?\n\n\nContact Information\n\n\nEmail\n@rywalker on Twitter\n\n\nLinks\n\n\nAstronomer\nKiss Metrics\nSegment\nMarketing tools chart\nClickstream\nHIPAA\nFERPA\nPCI\nMesos\nMesos DC/OS\nAirflow\nSSIS\nMarathon\nPrometheus\nGrafana\nTerraform\nKafka\nSpark\nELK Stack\nReact\nGraphQL\nPostGreSQL\nMongoDB\nCeph\nDruid\nAries\nVault\nAdapter Pattern\nDocker\nKinesis\nAPI Gateway\nKong\nAWS Lambda\nFlink\nRedshift\nNOAA\nInformatica\nSnapLogic\nMeteor\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Contact Information

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-08-06T05:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/011800ca-593b-4203-9d3e-8d3358637e5e.mp3","mime_type":"audio/mpeg","size_in_bytes":59881944,"duration_in_seconds":2570}]},{"id":"podlove-2017-06-17t01:30:26+00:00-59d95156935a241","title":"Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5","url":"https://www.dataengineeringpodcast.com/episode-5-rebuilding-yelps-data-pipeline-with-justin-cunningham","content_text":"Summary\n\nYelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline\n\n\nInterview with Justin Cunningham\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nCan you start by giving an overview of your pipeline and the type of workload that you are optimizing for?\nWhat are some of the dead ends that you experienced while designing and implementing your pipeline?\nAs you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house?\nWhat are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them?\nWhat are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy?\nWhile you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced?\nDid you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public?\nWhat advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project? \n\n\nKeep in touch\n\n\nYelp Engineering Blog\nEmail\n\n\nLinks\n\n\nKafka\nRedshift\nETL\nBusiness Intelligence\nChange Data Capture\nLinkedIn Data Bus\nApache Storm\nApache Flink\nConfluent\nApache Avro\nGame Days\nChaos Monkey\nSimian Army\nPaaSta\nApache Mesos\nMarathon\nSignalFX\nSensu\nThrift\nProtocol Buffers\nJSON Schema\nDebezium\nKafka Connect\nApache Beam\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.

\n\n

Preamble

\n\n\n\n

Interview with Justin Cunningham

\n\n\n\n

Keep in touch

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-06-17T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/049a093d-67d9-4c9d-a0f6-405a89bb8b4a.mp3","mime_type":"audio/mpeg","size_in_bytes":70067587,"duration_in_seconds":2547}]},{"id":"podlove-2017-03-18t11:11:04+00:00-e27681bd450533b","title":"ScyllaDB with Eyal Gutkind - Episode 4","url":"https://www.dataengineeringpodcast.com/episode-4-scylladb-with-eyal-gutkind","content_text":"Summary\n\nIf you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is ScyllaDB and why would someone choose to use it?\nHow do you ensure sufficient reliability and accuracy of the database engine?\nThe large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that?\nDeployment and tuning\nWhat challenges are inroduced as a result of needing to maintain API compatibility with a diferent product?\nDo you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up?\nAre there any issues with compatibility of plugins for CassandraDB running on Scylla?\nFor someone who wants to deploy and tune Scylla, what are the steps involved?\nIs it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap?\nWhat prompted the decision to form a company around the database?\nWhat are some other uses of Seastar?\n\n\nKeep in touch\n\n\nEyal\n\nLinkedIn\n\n\n\nScyllaDB\n\n\nWebsite\n@ScyllaDB on Twitter\nGitHub\nMailing List\nSlack\n\n\n\n\n\nLinks\n\n\nSeastar Project\nDataStax\nXFS\nTitanDB\nOpenTSDB\nKairosDB\nCQL\nPedis\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.

\n\n

Preamble

\n\n\n\n

Interview

\n\n\n\n

Keep in touch

\n\n

\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-03-18T07:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/731d2aec-f184-42b9-849d-fdf9e8edf0eb.mp3","mime_type":"audio/mpeg","size_in_bytes":37751819,"duration_in_seconds":2106}]},{"id":"podlove-2017-03-05t02:45:25+00:00-648c6cb8078c5d9","title":"Defining Data Engineering with Maxime Beauchemin - Episode 3","url":"https://www.dataengineeringpodcast.com/episode-3-defining-data-engineering-with-maxime-beauchemin","content_text":"Summary\n\nWhat exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.\n\nTranscript provided by CastSource\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Maxime Beauchemin\n\n\nQuestions\n\n\nIntroduction\nHow did you get involved in the field of data engineering?\nHow do you define data engineering and how has that changed in recent years?\nDo you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen?\nFor someone who wants to get started in the field of data engineering what are some of the necessary skills?\nWhat do you see as the biggest challenges facing data engineers currently?\nAt what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain?\nHow much analytical knowledge is necessary for a typical data engineer?\nWhat are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality?\nYou have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice?\nHow has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices?\nHow do you see the role of data engineers evolving in the next few years?\n\n\nKeep In Touch\n\n\n@mistercrunch on Twitter\nmistercrunch on GitHub\nMedium\n\n\nLinks\n\n\nDatadog\nAirflow\nThe Rise of the Data Engineer\nDruid.io\nLuigi\nApache Beam\nSamza\nHive\nData Modeling\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

\n\n

Transcript provided by CastSource

\n\n

Preamble

\n\n\n\n

Questions

\n\n\n\n

Keep In Touch

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-03-04T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c9bdfeec-5f43-47c3-9535-811c2c8f953c.mp3","mime_type":"audio/mpeg","size_in_bytes":72883818,"duration_in_seconds":2720}]},{"id":"podlove-2017-01-22t16:54:47+00:00-69ad1641ced993b","title":"Dask with Matthew Rocklin - Episode 2","url":"https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin","content_text":"Summary\n\nThere is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.\n\n\nInterview with Matthew Rocklin\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nDask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated?\nThere are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options?\nOne of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long?\nDo you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future?\nAre you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons?\nThere is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill?\nFor anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like?\nHow do the task graph capabilities compare to something like Airflow or Luigi?\nLooking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution?\nWhat are some of the most interesting or unexpected projects that you have seen Dask used for?\nWhat do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support?\nWhat are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project?\nI know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?\n\n\nKeep in touch\n\n\n@mrocklin on Twitter\nmrocklin on GitHub\n\n\nLinks\n\n\nhttp://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss\nhttps://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss\nContinuum Analytics\n2sigma\nX-Array\nTornado\n\nWebsite\nPodcast Interview\n\n\n\nAirflow\nLuigi\nMesos\nKubernetes\nSpark\nDryad\nYarn\nRead The Docs\nXData\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.

\n\n

Preamble

\n\n\n\n

Interview with Matthew Rocklin

\n\n\n\n

Keep in touch

\n\n\n\n

Links

\n\n

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-01-22T10:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0df71046-f709-43e1-a39f-8447e751024a.mp3","mime_type":"audio/mpeg","size_in_bytes":32558080,"duration_in_seconds":2760}]},{"id":"podlove-2017-01-14t18:36:08+00:00-fee8f4d3c9604ae","title":"Pachyderm with Daniel Whitenack - Episode 1","url":"https://www.dataengineeringpodcast.com/epsiode-1-pachyderm-with-daniel-whitenack","content_text":"Summary\n\nDo you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake.\n\n\nInterview with Daniel Whitenack\n\n\nIntroduction\nHow did you get started in the data engineering space?\nWhat is pachyderm and what problem were you trying to solve when the project was started?\nWhere does the name come from?\nWhat are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options?\nBecause of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics?\nWhat does Pachyderm use for the distribution and scaling mechanism of the file system?\nGiven that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data?\nFor a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow?\nGiven that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that?\nAnother compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions?\n\nHow did you implement this feature so that it would be maintainable and easy to implement for end users?\n\n\n\nGiven that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this?\nThe data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow?\nI see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill?\nWhat are some of the most interesting deployments and uses of Pachyderm that you have seen?\nWhat are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project?\n\n\nKeep in touch\n\n\nDaniel\n\nTwitter – @dwhitena\n\n\n\nPachyderm\n\n\nWebsite\n\n\n\n\n\nFree Weekend Project\n\n\nGopherNotes\n\n\nLinks\n\n\nAirBnB\nRethinkDB\nFlocker\nInfinite Project\nGit LFS\nLuigi\nAirflow\nKafka\nKubernetes\nRkt\nSciKit Learn\nDocker\nMinikube\nGeneral Fusion\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"

Summary

\n\n

Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!

\n\n

Preamble

\n\n\n\n

Interview with Daniel Whitenack

\n\n

\n\n

Keep in touch

\n\n

\n\n

Free Weekend Project

\n\n\n\n

Links

\n\n\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\"\"

","summary":"","date_published":"2017-01-14T13:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/77759ae9-66c8-439b-90e8-1d9326279db6.mp3","mime_type":"audio/mpeg","size_in_bytes":42922090,"duration_in_seconds":2682}]},{"id":"podlove-2017-01-08t04:07:58+00:00-8f103a06ef5f7c5","title":"Introducing The Show","url":"https://www.dataengineeringpodcast.com/episode-0-introducing-the-show","content_text":"\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers.\nI’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer.\n\nInterview\n\nWho am I\nSystems administrator and software engineer, now DevOps, focus on automation\nHost of Podcast.__init__\nHow did I get involved in data management\nWhy am I starting a podcast about Data Engineering\nInteresting area with a lot of activity\nNot currently any shows focused on data engineering\nWhat kinds of topics do I want to cover\nData stores\nPipelines\nTooling\nAutomation\nMonitoring\nTesting\nBest practices\nCommon challenges\nDefining the role/job hunting\nRelationship with data engineers/data analysts\nGet in touch and subscribe\nWebsite\nNewsletter\nTwitter\nEmail\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"
\n

Preamble

\n\n

Interview

\n\n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

\n
\n\n

\"\"

","summary":"","date_published":"2017-01-07T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/bd6d3b16-30e7-47c2-b9e4-678672fbee81.mp3","mime_type":"audio/mpeg","size_in_bytes":7101268,"duration_in_seconds":263}]}]}