Building A Data Lake For The Database Administrator At Upsolver

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

02 June 2020

Building A Data Lake For The Database Administrator At Upsolver - E135

0:00/0:00

Share on social media:

Summary

Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver

Interview

Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of what a data lake is and what it is comprised of?
We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?
- How has Upsolver changed or evolved since we last spoke?
  - How has the evolution of the underlying technologies impacted your implementation and overall product strategy?
What are some of the common challenges that accompany a data lake implementation?
How do those challenges influence the adoption or viability of a data lake?
How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?
- What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?
What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform?
How is the SQL layer in Upsolver implemented?
- What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?
What are the main concepts that you need to educate your customers on?
What are some of the pitfalls that users should be aware of?
What features of your platform are often overlooked or underutilized which you think should be more widely adopted?
What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver?
What do you have planned for the future?