Clubhouse

Maintaining Your Data Lake At Scale With Spark - Episode 85

Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Read More

Managing The Machine Learning Lifecycle - Episode 84

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Read More

Evolving An ETL Pipeline For Better Productivity - Episode 83

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Read More