Open Source Production Grade Data Integration With Meltano

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

13 July 2020

Open Source Production Grade Data Integration With Meltano - E141

0:00/0:00

Share on social media:

Summary

The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines.

Interview

Introduction
How did you get involved in the area of data management?
Can you start by describing what Meltano is and the story behind it?
Who is the target audience?
- How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano?
What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform?
- What are the most painful transitions in that lifecycle and how does that pain manifest?
How and why has the focus of the project shifted from its original vision?
With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem?
- What are the main elements of your strategy to address these barriers?
How is the Meltano platform in its current incarnation implemented?
- How much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction?
What have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use?
What are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano?
When is Meltano the wrong choice?
What is your broad vision for the future of Meltano?
- What are the most immediate needs for contribution that will help you realize that vision?

Contact Info

Website
DouweM on GitLab
DouweM on GitHub
@DouweM on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat