Data Infrastructure Automation For Private SaaS At Snowplow

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

18 February 2020

Data Infrastructure Automation For Private SaaS At Snowplow - E120

0:00/0:00

Share on social media:

Summary

One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.

Interview

Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the components in your system architecture and the nature of your managed service?
What are some of the challenges that are inherent to private SaaS nature of your managed service?
What elements of your system require the most attention and maintenance to keep them running properly?
Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity?
How do you manage deployment of the full Snowplow pipeline for your customers?
- How has your strategy for deployment evolved since you first began Soffering the managed service?
- How has the architecture of the pipeline evolved to simplify operations?
How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?
- What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?
  - How does that reflect in the tooling that you use to manage their deployments?
What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly?
What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow?
What are some lessons that you can generalize for management of data infrastructure more broadly?
If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently?
What do you have planned for the future of the Snowplow product and infrastructure management?

Contact Info

LinkedIn
jbeemster on GitHub
@jbeemster1 on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat