Automating Your Production Dataflows On Spark - Episode 105

Summary

As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.

What happens when your expanding log & event data threatens to topple your Elasticsearch strategy? Whether you’re running your own ELK Stack or leveraging an Elasticsearch-based service, unexpected costs and data retention limits quickly mount.  Now try CHAOSSEARCH.  Run your entire logging infrastructure on your AWS S3.  Never move your data. Fully managed service.  Half the cost of Elasticsearch. Check out this short video overview of CHAOSSEARCH today!  Forget Elasticsearch! Try  – search analytics on your AWS S3.


Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.  


linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more.
  • Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what the Ascend platform is?
    • What was your inspiration for creating it and what keeps you motivated?
  • What was your criteria for determining the best execution substrate for the Ascend platform?
    • Can you describe any limitations that are imposed by your selection of Spark as the processing engine?
    • If you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it?
  • Can you describe the technical implementation of Ascend?
    • How has the system design evolved since you first began working on it?
    • What are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers?
  • How does the programming interface for Ascend differ from that of a vanilla Spark deployment?
    • What are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment?
  • How do you enforce the lack of side effects in the transforms that comprise the dataflow?
  • Can you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers?
  • What are some of the most challenging aspects of building and launching Ascend that you have dealt with?
    • What are some of the most interesting or unexpected lessons learned or edge cases that you have encountered?
  • What are some of the capabilities that you are most proud of and which have gained the greatest adoption?
  • What are some of the sharp edges that remain in the platform?
    • When is Ascend the wrong choice?
  • What do you have planned for the future of Ascend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Automating Your Production Dataflows On Spark 1