Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
This episode of the Data Engineering Podcast is brought to you by Clubhouse, the first project management platform for software development that brings everyone together so that teams can focus on what matters – creating products their customers love. Clubhouse provides the perfect balance of simplicity and structure for better cross-functional collaboration. Its fast, intuitive interface makes it easy for people on any team to focus-in on their work on a specific task or project, while also being able to “zoom out” to see how that work is contributing towards the bigger picture. With a simple API and robust set of integrations, Clubhouse also seamlessly integrates with the tools you use everyday, getting out of your way so that you can deliver quality software on time.
Listeners of the Data Engineering Podcast can sign up for two free months of Clubhouse by visiting dataengineeringpodcast.com/clubhouse.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral
- How did you get involved in the area of data management?
- Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
- Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
- What are your primary sources of data and what are the targets that you are loading them into?
- What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
- What were your criteria for your replacement technology and how did you gather and evaluate your options?
- Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
- What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
- What were the big wins?
- What was your evaluation framework for determining whether your re-engineering was successful?
- Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
- If you have freed up time for your engineers, how are you allocating that spare capacity?
- What do you hope to see from DataCoral in the future?
- What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Periscope Data
- Mode Analytics
- Data Warehouse