Solving Data Lineage Tracking And Data Discovery At WeWork

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

16 December 2019

Solving Data Lineage Tracking And Data Discovery At WeWork - E111

0:00/0:00

Share on social media:

Description
Transcript
Chapters

Summary

Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction
How did you get involved in the area of data management?
Can you start by describing what Marquez is?
- What was missing in existing metadata management platforms that necessitated the creation of Marquez?
How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?
- How does it compare to the Amundsen platform that Lyft recently released?
What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?
What are some of the capabilities that are unique to Marquez and how are you using them at WeWork?
What are the primary resource types that you support in Marquez?
- What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?
Can you explain how Marquez is architected and how the design has evolved since you first began working on it?
- Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
  - What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?
How is the metadata itself stored and managed in Marquez?
- How much up-front data modeling is necessary and what types of schema representations are supported?
Can you talk through the overall workflow of someone using Marquez in their environment?
- What is involved in registering and updating datasets?
- How do you define and track the health of a given dataset?
- What are some of the interesting questions that can be answered from the information stored in Marquez?
What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?
For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?
What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?
When is Marquez the wrong choice for a metadata repository?
What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem
- @J_ on Twitter
- Email
- julienledem on GitHub
Willy
- LinkedIn
- @wslulciuc on Twitter
- wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Data Engineering Podcast