How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
November 11th, 2018
51 mins 50 secs
About this Episode
Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Upsolver is and how it got started?
- What are your goals for the platform?
- There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
- What are the shortcomings of a data lake architecture?
- How is Upsolver architected?
- How has that architecture changed over time?
- How do you manage schema validation for incoming data?
- What would you do differently if you were to start over today?
- What are the biggest challenges at each of the major stages of the data lake?
- What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
- When is Upsolver the wrong choice for an organization considering implementation of a data platform?
- Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
- What features or improvements do you have planned for the future of Upsolver?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Upsolver
- Data Lake
- Israeli Army
- Data Warehouse
- Data Engineering Podcast Episode About Data Curation
- Three Vs
- Kafka
- Spark
- Presto
- Drill
- Spot Instances
- Object Storage
- Cassandra
- Redis
- Latency
- Avro
- Parquet
- ORC
- Data Engineering Podcast Episode About Data Serialization Formats
- SSTables
- Run Length Encoding
- CSV (Comma Separated Values)
- Protocol Buffers
- Kinesis
- ETL
- DevOps
- Prometheus
- Cloudwatch
- DataDog
- InfluxDB
- SQL
- Pandas
- Confluent
- KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA