Building A Data Lake For The Database Administrator At Upsolver - Episode 135

Summary

Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

Springboard LogoMachine learning is finding its way into every aspect of software engineering, making understanding it critical to future success. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype.

Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
  • Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of what a data lake is and what it is comprised of?
  • We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?
    • How has Upsolver changed or evolved since we last spoke?
      • How has the evolution of the underlying technologies impacted your implementation and overall product strategy?
  • What are some of the common challenges that accompany a data lake implementation?
  • How do those challenges influence the adoption or viability of a data lake?
  • How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?
    • What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?
  • What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform?
  • How is the SQL layer in Upsolver implemented?
    • What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?
  • What are the main concepts that you need to educate your customers on?
  • What are some of the pitfalls that users should be aware of?
  • What features of your platform are often overlooked or underutilized which you think should be more widely adopted?
  • What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver?
  • What do you have planned for the future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. What advice do you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast.com slash 97 things to add your voice and share your hard earned expertise. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at linode with 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and a bulletproof data platform. Machine Learning workloads they've got dedicated CPU and GPU instances. Go to data engineering podcast comm slash linode. That's Li n od e today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. springboard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program. In this online project based course every student is paired with the machine learning expert who provides unlimited one to one mentorship support throughout the program via video conferences. You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production and managing the lifecycle of a deep learning prototype. springboard offers a job guarantee meaning that you don't have to pay for the program until you get a job. in this space, the data engineering podcast is exclusively offering listeners 20 scholarships a $500 to eligible applicants. It only takes 10 minutes and there's no obligation go to data engineering podcast comm slash springboard today and apply Make sure to use the code AI springboard when you enroll your hostess Tobias Macey and today I'm interviewing Henri Raphael and Yoni aney about building a data lake for the DBA at Upsolver. So Ori, can you start by introducing yourself?
Ori Rafael
0:02:29
Sure. So I'm I'm the CEO, one of the co founders of upsolver and coming from a DBA Big Data Integration background
Tobias Macey
0:02:38
And Yoni, how about you?
Yoni Eini
0:02:39
Hey, I'm I'm Yoni. I'm the CTO and the other co founder of solver most of my experience before up solver is around data science, big data preparation, streaming data, all sorts of stuff like that. And and then of course in Uppsala. We're building a high scale data like platforms,
Tobias Macey
0:02:57
and Ori, how did you first get involved in the area of data management And
Ori Rafael
0:03:00
I was working on a on a data lake. With Jani, we were trying to solve an advertising optimization problem over the data lake. And we found ourselves spending a lot of time a lot of data engineering time, just to be able to query and work with the data. So I was thinking, I think it was a good DBA. But they didn't really have the ability to go and work with the data lake directly. So that's when we started started working with data management. And that's when we
Tobias Macey
0:03:28
kind of talked about the idea and the only How about you? How did you first get involved in data management?
Yoni Eini
0:03:33
So I think for me, it's
0:03:34
always been something that has been top of mind. So from I'm like, My first job was in the IDF and I was a DBA. Then after that, as a data scientist and a researcher and application developer finally CTO there and like throughout everything, I mean, everything was data there everything is large volume. I mean, well, small volumes from today's perspective, but from none it was it was large volumes. With streaming data, a lot of kind of tricky decisions on how to manage it and where to put it. So it's always been something that was super interesting for me. And then I think that in up solver, kind of like, I think we we very easily gravitated towards that of taking on this hard problem of, like, how should this data actually be managed? What are the best practices? And Do people really need to worry about it at all? Or should it just be done? Kind of, is there a right way to do it?
Tobias Macey
0:04:27
So Yoni, you've actually been on the show about almost two years ago now in November of 2018, when we talked a bit about what you're building it up solver there. And so it's this platform for managing data lakes in the cloud. And before we get too far along in the conversation, can you each give your definition of how you consider what a data lake is and what it's comprised of?
Ori Rafael
0:04:48
Sure. So I think that that would the best, the easiest way to think of a data lake is that it's a couple database. So I'm taking the storage part, the metadata part, the compute part and kinda Starting to manage each one separately, and which gives me a lot of advantages when it comes to the city and Cost Management and scaling. And on the other side, it creates a lot of complexity. That would be my definition.
Yoni Eini
0:05:12
Yeah, I think I think for me, like, in the end, the data lake is a cost thing. It's, you know, data as data volumes grow, you know, databases are great, they're really, really good at what they do. But as data volumes grow, it just becomes too expensive. And the cost also means how much data you can deal with. So if I have a database that cost me $1,000 per terabyte, then then at 50 terabytes per month at 50 terabytes, I'm thinking to myself, well, maybe I maybe I don't want to store this data anymore. And then And then along comes the data lake and it really opens up your capability to deal with a lot more data just because it's so cheap. So it's really like I mean, in the end, the bottom line really makes the whole difference. So I'd say that for me like a data lake is is the natural progression towards larger and larger datasets and and I think it's very important to think about That, as you know, the data like you shouldn't be a compromise. Like I think maybe today for a lot of people, it is a compromise, that they would rather have the data in a database. But like it should be as powerful and as useful as a database. So then from my perspective, that's, that would just since I see it as that's the case, it would just be the natural progression that things just go there.
Tobias Macey
0:06:20
And some of the initial ways that the data lake started to manifest was with the Hadoop ecosystem and the MapReduce project and the HDFS file system for being able to spread your data across multiple machines. And nowadays, a lot of people are using object storage for the actual storage mechanism. And even in the past two years, there's been a lot of movement in the availability of different technologies for working with data lakes and different managed platforms such as yours for being able to build out a data lake with a single layer, and I'm wondering how the overall landscape of data like technologies and the overall adoption of data lakes as a solution for businesses has changed since the last time that we spoke.
Yoni Eini
0:07:04
I mean, I can I can jump in and say that, you know, from my perspective, I think the tools have become a lot more powerful. So if two years ago, there were a few compromises that you really had to make, because you know, in the end, if you're talking about like a data lake, so it's not it's not an SSD, you have an access latency. Of course, there's all the eventual consistency issues of your storage and things like that. And I think that over the past two years, these issues have pretty much been resolved. So like today from object storage, you get performance that's equivalent to SSDs. And I mean, the cost hasn't gone down much and in that time, but just the fact that the performance is so much better today means that I think that pretty much any use case you can think of you could solve using a data lake given the correct data modeling. Maybe the other
Ori Rafael
0:07:50
thing too biased is the popularity. So today feel like like two years ago, I was still explaining data lakes in many cases. And today like everyone One has a data lake agenda. And if you look at the top companies in the market today to look at, like the spark adoption, like everyone are doing Spark, like much more comparing to two years ago, and if you look at the big data warehouses, so they're trying to start calling themselves you know, they we are a data warehouse We are, we are a data lake, we are a data platform. And they try to add capabilities to query both their traditional model, but also query the data lake directly. And I think you see it across all the big vendors. So you have redshift with redshift spectrum, giving you the data lake capabilities. And you have the launch of Azure cnips analytics with serverless queries over blob storage and BigQuery has BigQuery external tables in beta. So snowflake calling themselves the data platform today and not the data warehouse Cloud Data Warehouse like they did in the past. So I think that's a big change when everyone kind of want to succeed with the data lake use cases with different types of solutions
Tobias Macey
0:08:56
and how do the clouds data warehouses such as snowflake or BigQuery differ from the full fledged data lake in terms of what they're available to offer, and maybe some of the cost issues or performance capabilities compared to using the native data lake technologies, whether that's things like presto and spark or managed platform,
Yoni Eini
0:09:20
I'd say that there are two major advantages that you have to a data lake. One is the cost. And I think that both BigQuery and snowflake are kind of addressing the costs. I mean, of course, there's a cost of the platform itself, but but like, beyond that, there's still like the cost per storage is going to be similar to a data lake. So I think they're really like, and by the way, you see that across a lot of vendors. I mean, also Kafka now have a or I'm not sure if it's out yet or they're releasing it as we speak. But they have like the s3 extension to the Kafka stream, or, or you have like, yeah, as like redshift, using redshift spectrum. So they're all kind of adding these extensions to the data lake and enjoying the cost. But specifically when you talk about snowflake and and, and BigQuery, they're the data is sitting inside these systems. So once it goes in, it doesn't come out. And I think that really breaks the The second big advantage of the data lake is that you can put it before other systems. So it goes in the middle of your architecture, all your data comes in, it goes into the data lake, and then you can do whatever you want with it and send it to whoever wants to consume it. And with BigQuery and snowflake, you really pretty much need to consume it within BigQuery and snowflake and I think that really limits the flexibility of what you can do. And and also like there are a lot of use cases like I mean, if you want to use a key value store, you can't really like there's no way to pull the data out of snowflake, put it into a key value store really that's going to be effective or, or like
Ori Rafael
0:10:46
Yeah, exactly. Think as a DBA likes good. Don't stop talking old age DBA everything. We solved everything with Oracle because our data was in Oracle. So today it's called BigQuery or snowflake, but if they their data is there and two by three format, I'm going to take go to great lengths to solve the problem there and not to replicate the data to edition a database and data lake gives me that ubiquitous ubiquitous access option that they don't have with database that come with a proprietary format.
Tobias Macey
0:11:15
And so in that same time span, roughly the past two years, in addition to the changes in database technologies, and its overall adoption, what are the ways that the up solver platform has changed or evolved since we spoke, and how has the evolution of those underlying technologies impacted your strategy for implementation and the features that you decided to include?
Yoni Eini
0:11:39
So I'd say that that, I mean, last time we spoke, that was in 2018, that was like our first year of general availability. So I mean, of course, the platform has changed a lot as far as just maturity and kind of scale how much data we can deal with how much data we are dealing with. So like today, we have customers that are putting two gigabytes per second through system, which was like a far cry from where we were then. And I'm sure it can scale much, much beyond that. But so far, that's the biggest. But then and then I think that the main differentiation, the main difference between the platform, then the platform today is that today we have a sequel as a definition language on top of the UI based architecture. So it used to be just UI, it used to be that you could define everything using user interface, and you had a very broad set of capabilities. But today, you have full SQL on top of your data. And I really think that that's, that's a game changer, because in the end, I mean, UI is very nice. It democratizes in a way that even SQL can't necessarily because not everyone knows SQL. But in the end SQL is the language of data. I mean, even if like no matter what ETL process I'm doing, in the end, I'm most likely going to be querying that the results using SQL. So it's really I think, a huge difference that they have one language for the entire for the entire pipeline. And then the second question, I mean about the the underlying technologies, so The fact that s3, I mean, I'm not sure how much people are aware of this, because s3 was always really powerful. And like, I'm not sure how many how many people are actually pushing the limits of s3, but it used to be a few years ago, they had like a best practices, you want to have these prefixes to your buckets that are hashes, which completely didn't fit with anything else that s3 does. It was kind of like a performance optimization they were telling you to do. And about a year and a half ago, they released a new article saying, Well, actually, you don't need to do that anymore. And by the way, remember that we told you that it's good for 100 requests per second. So now it's good for 5500 requests per second. And you can scale it out between different prefixes in the bucket, and between buckets. So you can multiply that by 10 if you want. So I think the difference between 10 100 requests per second and 5500 or 55,000 requests per second to your s3 your blob storage layer is basically it means that you can do anything like you really don't need anything aside from this superficial heap storage, which wasn't wasn't necessarily the case two years ago. I think that's a very big difference.
Tobias Macey
0:14:04
And so in terms of the adoption and implementation of data lakes, what are some of the common challenges that accompany that? Whether they're using a managed platform or doing some self built system using a lot of open source technologies? What are some of the difficulties or complexities that arise regardless of the actual technologies that you're building on top of,
Yoni Eini
0:14:30
I think if I'm paraphrasing, the fact that I'm using a data lake, what gotchas are there? What what, like, how is that making my life miserable? So I think that in general, in a data lake, you really need to worry or at least traditionally, you'd really need to worry about the low level stuff. It's not like you know, in a database, sure, you have a DBA. And you have to build your indices and things like that. But I mean, in the end, you're not worrying about how the database is storing the files or if there is like replication going on or what's happening. running behind the scenes to make sure that it's load balancing or stuff like that all that stuff is handled for you by the database itself. With a data lake, it's even worse than that, like, you don't only need to worry about the load balancing and the like, where you're gonna store it and how you're gonna store it. It's also in the end, you're using this very, like, I mean, I was just saying how powerful it is. But it's actually very weak in the sense that you don't have a lot of capabilities of discovering data, I mean, you pretty much need to know where it is. So you have this file system, you have to figure out your folder structure, you have to figure out your file formats, you have to figure out your compression, like all this stuff that a database would normally do for you, you need to do on your own. So like all these, you know, triggering so process management and an orchestration and, and and if you want state where to keep the state and like in the end, it's a mess. So so it comes out that that like 90% of the time you're working on kind of like making sure that your data lake is performing as a database should or performing as the storage system of a good system would be, and kind of just like, you know, putting it in place and preparing that and and massaging it, make sure it's good. And then 5% of your time is actually doing stuff with the data is is consuming it. Like often it's going to be different people doing the two. So then the people who are consuming the data lake are just going to be waiting for people to implement stuff for them. But But yeah, I think that I mean, on prem, it's even worse, because then you even have to manage the storage itself. I mean, Hadoop isn't the like HDFS isn't the easiest thing in the world to manage. But even in the lake, and you have s3 and that's super great. But But you still need to do a lot of work to make sure that you're that everything works as it should. I mean, it doesn't really add up in the end. I mean, you hear that a lot about data swamps, but you wrote data into your Lake, and then it became a swamp because you can't access it. You lost the data, basically. And it's even worse because you're paying for it and you can't even get really delete it because you don't know where it is. So I think that would be the common challenge is that you actually have to build the data lake and you're not just consuming Kind of prebuilt product, which is a data lake,
Tobias Macey
0:17:02
so up solver I know aims to solve a lot of those challenges of ensuring that there is some appropriate modeling going on and that the data as it's being written into the lake is optimized for access and scale. And you mentioned too, that with the database, you generally don't need to worry too much about what's going on under the covers because it handles a lot of it for you. But for people who are using a data lake, particularly if they're using something like up solver, what are some of those underlying realities of the data systems that power the lake and power your platform that still need to be understood by the operators to ensure that they don't accidentally shoot themselves in the foot
Ori Rafael
0:17:40
in app software? Pretty much the only think about how they're going to do transformation of data. So they're pretty much mapping their whole data into tables, so you're not not giving them any control over orchestration? Or how exactly data is that we give some configuration control, but the idea is that the approaches don't matter. The users do what they shouldn't do. Don't, don't let Don't let them trip. So you're kind of asking where they could still trip. And I think it's more like the concept of how you're going to eventually organize the data. So I'm as a DBA, I'm always used to creating relational models. So I'm creating the relational model, and then the use of to answer their question as views on top of that model. But if you go to a data lake, you can't create a relational model, because data lake is not indexed, all of those joins will not work well. So you should pretty much take care of all data and map it directly into tables that are not necessarily relational. And I try to maybe illustrate that with an example. So let's say I'm working on an advertising problem, I have a table of impression and a table of clicks. So I could just go and create a table for each one and then try to join them it will just not work. But the way I would approach it in data lake, I would create a table that includes both the impressions and clicks and let the user query that although it would cost To more storage, storage is cheap, it will be much more beneficial from a query perspective.
Yoni Eini
0:19:06
Yeah, I think like, you know, when I think about databases in the end, I mean, there are two things in a database that I need to worry about. One is the data model. So making it like kind of making it consistent with itself and making it like correct and making it useful. And then the second is that I don't run out of space. Like, that's pretty much the two things I care about, like I can't run out of space. And I have to make sure that the data is well model. I think in a data lake. I mean, you don't have that space constraint, you don't need to worry about running out of space. So then you can you can solve problems in a much easier way I would say. And yeah, like the de normalization is one of those things, I'd say a main thing that like in the data lake, I mean, you don't need to worry about that. I mean, it's actually simpler. It's closer to how people actually think about stuff rather than then than a traditional database. And with this sequel layer that you introduced, you mentioned that there was always a visual workflow for being able to work with the information in the lake, but how does the Introduction of that SQL capability change the accessibility and the overall staffing options for people to manage the data lake
Ori Rafael
0:20:09
eventually 20 times more users. So if you're gonna try to do this research, we went to LinkedIn looking like how many people know SQL like how many people know data engineering or a do for those kind of technologies, you see that like the number of data engineers in the world is like 2% to 5% of all data practitioners. And also take a look at like the growth in number of data lakes, if I would look at on prem Hadoop, like combine all the customers from Cloudera and map out and hortonworks together, become rich to less than 10,000, like paying customers for like a distribution of a dupe, and I'm looking at Amazon s3 and there are over a million customers. So you have like, hyper like an exponential growth in data that you definitely don't have an exponential growth in number of data per in number of data engineers. So I think the maybe the most the biggest, most important thing is that solver is the ability to go to 20 times more users that are getting direct access to the deck, and don't need to drive the entire process to 20 or 2 billion engineering.
Yoni Eini
0:21:12
And I think like when you're talking about SQL, so, you know, it's very easy a lot. I mean, a lot of other systems, like, I don't want to like badmouth anyone, but a lot of systems would say that, okay, we support SQL and, and like, in reality, they kind of support Okay, like, you know, you take your select from where, and, and like, be thankful for the five, five built in functions that we give you. And and I mean, okay, sure that that kind of is supporting SQL like, it means that you're still defining your transformations in SQL, but it's not really that useful. In the end, it's very much limiting your capabilities in order to fit into the language or rather like that maybe the capabilities were limited to begin with, and then the language is just reflecting that. I think there's a big challenge in making sure that you're actually like both Giving fully fledged SQL. So actually having all the capabilities that someone would expect from SQL, which is not trivial when you're talking about streaming data especially, and also not on top of the data lake, and and then and then also making sure that people understand that, given that they're defining their ETL is in SQL, they're not losing any capabilities. It's not like if I were to write things in code, SQL would be like, I would have more I would be able to do other stuff or do better stuff, or do do what I want, but I can't do it in SQL, like, it kind of needs to re educate people. And I mean, in databases, they were already convinced that people switched there because they, they understood that they could do all their data stuff in SQL, you kind of need to convince now that that that actually you don't need the code in Spark, you can do it in SQL, and it will actually do everything you want. And so in terms of the actual ways that you have implemented SQL, I know that you're using the NC standard, but what are some of the useful extensions that you have incorporated that are relevant to the data lake context that simplify the work of your users. So I'd say there are three very powerful extensions that I mean, the first one I kind of wish was part of NC to begin with. So sequel is is super nice for data modeling and everything, but it's very hard to build functions on top of one another. So for example, I want to do a pipe be like I want to just concatenate two strings, that's fine. And then let's say I want to concatenate two strings, and then do something else on top of that. So you can also do that, but then you can't reuse it anywhere else in your statement. So often, you're gonna have like in your SELECT statement, you have a state you have a function, then you have the exact same function in your group by or in your WHERE clause, just because there isn't any composability there. So one extension that we added is basically you can think of it as a procedural set statement. So set field name equals whatever transformation and that that like single small addition to SQL allows you to define Complex transformations, and then consume them in additional, like, like down the stream and additional parts of your SQL statement. And I think that like, I mean, sequels answer to that is using sub queries. So you define a bunch of transformations in your innermost query and then use another sub query to define additional transformations on top of that, and like it becomes horrendous, it's very hard to read that kind of statement. So I think using the set statements really simplifies kind of, and especially like often in SQL for ETL. Specifically, you'd have like 10, or 15, different data transformations, it would really become a nightmare without that. So I think that's one very, very big difference as far as the capabilities of language. A second thing is that so sequel is is traditionally made for, you know, for, for relational databases, which are flat, like your tables are flat, and then eventually they added JSON support and they added like nested column support and stuff like that, but it's very clunky and nobody knows about it. Like you Nobody knows actually how to interact with nested data using SQL. So we added a few language extensions around dealing with nested data within the original structure. So I mean, conceptually and this is, maybe it's a bit hard without drawing it. But like, conceptually, you have, let's say you have a purchase, which is a JSON. And this is how raw data comes in, like 90% of the raw data we see in the world is JSON. So you have your purchase, which is the root of the object. And then you have an array of items. And each item might have, let's say, a quantity and a price per item or something like that. And now I want to reason about these things. And usually what you're going to do is all your transformations are either happening happening at the item level, or they're happening at the purchase level. So I might want to say multiply the quantity of the item by the price of the item. And then obviously, I want to multiply within the same item. I want to scope that calculation to each individual item. I don't want to like take, I don't know all of the as would, by the way happen in native SQL, I don't want to take all of the all of the prices and do a Cartesian product with all of the quantities and then multiply everything together and get an array of n squared. Size. Like that doesn't make any sense. So we have a line, which is a language extension, which is very subtle, which simply allows you to access fields within arrays directly. But when you do that, and up solver, it makes sure that all the transformations that you do are scoped correctly based on kind of, like if I'm multiplying two values and their scope together in a field, so it will multiply within that field and not out of it. And that that really makes a big a very big difference as far as the ability to I mean, because again, like people, you know, people working with SQL, obviously, there's, you know, they're the SQL superstars. And they're the people who kind of like, you know, SQL is a second language. So you don't want to force the people with SQL as a second second language to just not ever deal with their nested data. Yeah, and I think I think the UI also helps there because it kind of exposes the syntax in a friendly way. When you added from the UI. You can see what the sequel generated by it looks like. And then the third thing and this is getting to be a pretty long answer. But but the third thing I would say is that is a sequel generally deals with static data. So you have kind of a table, the table is finished. I mean, it's not always in SQL databases, new data is coming in all the time. But But in essence, when you run a query, there isn't really a built in temporal aspect. Whereas when you're dealing with data, lakes and streaming data, there has to be a built in streaming aspect, because I have to deal with everything incrementally. So when an event arrives, I need to when I'm joining, I need to reason about when did that event arrive, and what portion of the lake and my joining into so we added a few language extensions around being able to bring in, for example, the latest data, or waiting for a few minutes in order for the data from the other stream to catch up or things like that. So there are a few additional keywords that we added, which allow you to kind of seamlessly deal with streaming data without needing to build huge sub selects that do those constructs.
Tobias Macey
0:27:56
And yeah, there's definitely been a trend that I've been seeing In the overall data space of this move towards streaming SQL, so a lot of that is implemented on top of things like Kafka streams with K SQL DB, or there's the event door platform for handling streaming data using SQL. Yeah, there's also the materialized platform. There's a lot of open source implementations. And so it seems like there's this sort of implementation specification of how streaming SQL works, at least conceptually, I'm sure that there are variants in terms of the specifics of the syntax. And I'm wondering what your thoughts are on the overall space of streaming SQL and the what you see as being the future of that in terms of incorporating it into the broader specification of SQL and its necessity, given the current ways that data is being used.
Yoni Eini
0:28:48
Yeah, so I think I mean, first of all, you have to keep in mind that the SQL specification is like how do I put this nicely? It's maybe a bit of a legacy thing. I mean, I know it's maintained, you have SQL 2006 thean and you have a lot of versions of SQL, and they come out with a lot of new stuff. But really, people are stuck in in 1999. It's not like so okay, common table expressions kind of became standard people know about them more or less. But a lot of the new features that you have in the SQL standard, they're actually not very standard, like none of the databases actually support them. And like even and even the old features, you know, you have a lot of stuff in there that is completely irrelevant. Now, like all sorts of like Fortran support as part of the standards. It's not like, it's not that relevant. So I'm not sure how much it's important because anyways, nobody like up until 10 years ago, when you weren't talking about streaming SQL, you just SQL. Still there wasn't actually a standard, the standardization was was very like was only skin deep. So I think that today, it's it's kind of the same, you know, everyone has their own flavor. They all have their own extensions and their own additional kind of keywords and how they reason about things. I'm not sure that's the end of the world as long as people kind Stick to the basics and make sure that at least like when someone writes a SQL statement, that's, that's kind of like using the common base of knowledge of people writing in SQL, that it's going to do what they expect, and not not something weird. And, and then, you know, if you add a keyword here and there, like I think it's, it's less important that that would actually be standardized just because it never was. So I mean, and then even if you add it to the standard, I mean, let's say if case equal really, like does really well, or if up solver does really well, and all these changes I just said are added into the SQL standard. I mean, is k SQL gonna change? Probably not like, I don't think they care that much. Yeah, I'd say like, the standardization is less important, but keeping things as concise as possible and as as close to the base as possible, I would say is maybe the most important thing. And then when talking about streaming SQL,
Tobias Macey
0:30:50
so I mean, definitely This is a huge is a huge thing. It's a huge departure because because it's not traditionally part of the language. And that's also like, you know, it's a very big conceptual shift for data consumers. And I think maybe that's one of the challenges that we see is that is that people need to start reasoning about streaming data where they're used to thinking about static datasets. Like that's a big difference between solver and spark, for example, like spark is it talks static, and then you can do streaming on top if you want. But but like the language is static, and whereas an up solver, the language is that it's streaming and then, but then you need to educate around that people need to realize that, well, you know, today data is is streaming that is what it is kind of, there's also the difference in terms of perspective of the initial broadly accepted view of the lamda architecture where you have the separation of batch versus streaming and you process streaming in real time as it comes in. But then you periodically go and backfill using your batch data to make sure that you have a high level of accuracy. And then there's been this proposed cap architecture which is more focused around streaming and being able to maintain access gets based on that streaming data. And I'm wondering what your thoughts are on the overall ideas of lambda versus Kappa, or some of the different ways that you can architect to be able to account for streaming data, while still being able to get some more detailed analysis based on the data that has actually landed in your Lake rather than trying to maintain these windowing functions where you have an imperfect view of the entire history of your information.
Yoni Eini
0:32:26
Yeah, so I mean, I'd say that first of all, like the problem with lambda architecture, is that you need to write stuff twice, you really don't want to be writing stuff twice, and especially the fact that like, I mean, lamda came out of the world of batch when they wanted to add a streaming layer. So you have like, okay, I already have my batch now add an additional streaming layer and figure that out in a separate language or something. And, and so I have my lambda architecture, and, and then I think Kafka is is kind of like, maybe it's a bit naive, but like, conceptually, it's saying, well, let's discard the past and say, all the data is streaming anyways, kind of like well I've been saying now and and so so actually, like, why do you even need the batch layer, maybe like, you know, just use the streaming layer and it'll do the batch and everything will be fine. And of course, you have performance issues. And like, you have to make sure you're not losing capabilities here. But definitely, you have to have one language, you can't have like two definitions of your same ETL. It has to be defined once, and then your infrastructure needs to convert it to either like the streaming layer, if you have such a separation into the streaming layer and the and the batch layer. And from observers perspective, what we did is, is we're really trying to do the we're a data lake platform. So of course, we can deal with huge amounts of historical data and batch data and all that, but we treat it as streaming data. So in a sense, I'd say that's similar to capa architecture in the sense that we only have one way of dealing with data, which is streaming. But we do do that in such a way that you can deal with a huge infinite amount of data. So I'd say slightly different from let's say how Flink I mean, the main proponents of Kafka would be I would say Flink, and the way Do it is kind of like you have a stream of data and you can just like well just run that stream from the beginning fast and maybe separate the stream into a lot of different streams. But that kind of requires a lot of preparation in advance and up solver, we build the data lake in such a way that you can just run your Kappa on top of the entire data lake very quickly, by splitting it up into time chunks and stuff like that. But you're still streaming over everything.
Ori Rafael
0:34:21
In Tobias. One thing that you mentioned is the sliding time window and the limitation there. I think that's like one big limitation the top server addressed. So the the way the streaming systems are built is that you can only address data and work with data that you can currently fit in memory in that time window. So we've built an index and indexing system that goes with your stream. So when you want to combine your real time data, you can combine it with all your historical data and not just the window, the data that you can fit in the window. Our objective here, do everything with a stream and implement Kappa and not do do not do lambda.
Tobias Macey
0:34:59
And one other thing thing that I want to call out from your earlier answer. Yoni is the handling of nested data. Because as you said, that is one of the consistent challenges, particularly in data lake systems is, you don't want to have to pre process the information a lot before it comes into your lake. And so you do often end up with these nested JSON structures or other formats that have nested fields. And being able to access those in an intelligent and natural way is something that's a shortcoming of a lot of the platforms that I've tried to use at various times. And so in the data warehouse approach, you would generally handle that flattening of nested structures as part of your transformation before loading it into the database. So recognizing that being able to handle those fields is definitely a benefit to people who are trying to work with their data without having to do a lot of upfront work ahead of time and potentially lose information or lose context by flattening things without necessarily introducing the appropriate sort of information as to where those flattened fields originally existed.
Yoni Eini
0:36:07
I mean, definitely the data lake needs to represent the original data, the data that was generated. I mean, if you're doing data transformations before dumping it into the data lake, you're doing it wrong, kind of because, I mean, I probably shouldn't make such like strong statements. But But any data transformation you do, and like you say, like you lose the link, and then and then it's actually gone forever. So you really want to make sure that your data lake is at least your single source of truth.
Tobias Macey
0:36:33
And so going back to the sequel implementation, I'm wondering if you can talk a bit about how it's implemented in your overall architecture and just some of the ways that the implementation of your system has changed since the last time we spoke.
Yoni Eini
0:36:49
So I mean, the way we did sequel in the end is you can think of it as a as a from the bottom up rather than a top down approach. So it's not like we said, Well, okay, we want SQL. So let's Just add SQL and see what works. It was more like we had in the back of our minds that SQL is important. But we didn't want to get to a point where we have a partial SQL language, like we didn't want to get to a point where, okay, I support SQL, but no joins or SQL and no group buys or, like things like that. So from, from our perspective, what we did is, it was kind of like, Okay, so first of all, do we actually have a full support for data transformations that you'd expect in SQL? And then we had that, and then it was like, okay, and how about like, how about dealing with filtering? Okay, that's, that's pretty easy. We had that. How about doing joins. So that one's pretty hard. And I think that implementing SQL made us so I mean, the the underlying capabilities were actually already there as far as joining. So the way a join works and up solver is that you build an index of the right hand side, and that index is sitting in s3. So it's still in the data lake but it's a conceptual key value store equivalent to what an index would be in in a database and then when you When you do a lookup from it, so when you do the join, that join is going to perform very fast. So that general capability was already there we already had at the time, we called them indexes. Now they're called lookup tables. We already had that capability. But the adding sequel really forced us to understand exactly what's necessary as far as the definition and boil it down to really the basics. I can say, select from stream left, join this lookup table on key. And that's it, and then it'll work. And it'll do exactly what people expect it to do. So I think it forced us to be a lot more concise about how you define these things. And then for group bys, also. So I mean, group by is actually a pretty interesting, interesting semantic on a data lake because you kind of have like in a database, when you run a group by it runs on the entire query. So you have I have a database with a billion records, and I say group by key, get me the max value, and I'm not going to get the max value of the key from the last minute. or something like that, I'm going to get a full SQL statement that just returns for that key, the max value over everything. And I might have indices behind the scenes that helped me resolve that. But in the end, the statement runs on all the data. And in a data lake, that's like, there isn't really that concept. You know, when I run a sequel in an ETL, it's like, Okay, I'm always appending more data, but like, so what am i grouping by? on? What time window is it or, and then I'm gonna have duplicate keys of the group by that's kind of weird when I run my queries on top of the on top of the data lake. So I think that was for us. The most challenging part is adding this kind of like replacing functionality. So that I mean, we support both because I mean, of course, you know, a lot of people are used to streaming data, and they want this append functionality. So I run, I say, group by country. Tell me the amount of users and I want every minute to just get however many users there were in that in each country. I just want to add it and I'll do an additional group by on my day. database, my data link layer, but the capability to say, select a country comma count distinct users from my stream group by country. And what I actually want is that the result is going to be a table that has one row for each country. And, and it has the count distinct users in that country. And every minute I want it to update. So it'll have the new count distinct values, but that's all I want, and the target. That's something that was conceptually super difficult to accomplish. And I think that was really like, you know, to get sequel working to get sequel out from my perspective, we really needed to have that capability. So I think that was probably the biggest challenge from our perspective is implementing kind of a real group by
Tobias Macey
0:40:42
on top of a data lake. And so in terms of when you're onboarding customers, particularly now that you have this sequel layer for being able to empower the DBA to be the owner of the data lake, what are some of the main concepts that you generally need to educate your customers on?
Ori Rafael
0:41:00
The main two things are the relation model relational model that we touched briefly before about not creating relational relational model in the end, but actually create the views they want to query. And the other thing is that think of your data as an Event Stream. So we said that the observer approaches Kappa, and not lambda. And we want to do everything in stream processing. But the way you need to think about is the current here looking at the context of an event, and you want to connect, like enrich that event with historical data, and you're not joining static data sets. So everything has a time based filter, time based way to think about everything is additive processing, and not standard batch. A lot of people are kind of used to batch and sometimes it takes some time until you do the leap forward to stream processing. And my experience has been with users that once they do it, they really can't go back but it Take them time to do
Tobias Macey
0:42:00
it. And what are some of the features of your platform or capabilities or ways of using it that are either often overlooked or underutilized by your customers that you think they would benefit from using more frequently?
Ori Rafael
0:42:16
I think that the since we met and even since you interviewed union, even before that, we were very focused on analytics use cases. So the person in the end was querying the data, and with the observer is a really good solution for doing machine learning, like streaming machine learning. And we
Tobias Macey
0:42:33
have a few customers that are using it and very happy with it. But we haven't really spent a lot of time educating our users. How is it right to build a data set from a stream in a way that you can actually productionize the models that you create afterwards? So I think that part is kind of overlooked. And it's something that we plan to change going forward with the streaming capabilities and being able to run machine learning models on your platform. What have you To be the adoption or viability of being able to do something like reinforcement learning that requires that real time feedback loop to be able to update the model and update its behavior in real time.
Yoni Eini
0:43:12
Yeah, I mean, exactly.
0:43:15
That exactly hits the nail on the head that you know, that's the kind of thing that's super easy to do with up solver kind of ridiculously easy. Like you don't even think about it, it just happens. Whereas people think of it as something that I might put it on my roadmap as a epic five year project to do something like that.
Ori Rafael
0:43:33
But let's be specific, unique, there's a very specific thing is the fact is that you are doing everything with streaming. And the fact that you kind of protect the user from taking data in the data from the future, like create a leakage in their model, or the fact that we are doing everything in stream means that you're not calculating your data set in one way and then doing the calculation of the features for serving in a different way. You have just one way to create the features Don't need to go to search for and find the bugs. And what did I do differently between my batch process of my stream process, that's usually the main issue we saw, unless you see something else, you
Yoni Eini
0:44:14
you kind of have to for real time machine learning, you have to have kind of the lamda architecture traditionally, I mean, it's really traditionally going to be you have spark for the offline, you have some code written in some language for the online. And really, these two code bases are never going to do exactly the same thing. And they often don't even have access to the same data at the same time. So because they're so different, and because machine learning models are very, very sensitive to small changes to how the data looks. It's just like, I mean, these projects just don't work in the traditional in the traditional sense. Like you kind of have to have the cap architecture, you have to have the single way of defining it and accessing the historical data and it has to work the same on your offline data and on your online data. has to work exactly the same otherwise, it's just not going to work. And yeah, I mean that complexity just goes away because that's just how I'm solver
Tobias Macey
0:45:06
does it in terms of your experience of building the solver platform and democratizing it for the DBA to be able to handle the data lake and just growing the overall business and technical elements of the company? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process of giving to the DBA just overall of building up solver?
Ori Rafael
0:45:27
I think that the you can probably asking both from a business and technological perspective, right? Yes, I think for me, like we we got when we only started, we were trying to always build the best product always be create the best engineering solution to every problem. And then you find out that the best solution is nothing necessarily a familiar solution. It's it's hard to keep educating users all the time. It creates friction in both the sales process and the implementation process. So we kind of change our state of mind to looking at familiar The time the like the addition of SQL to the platform was after we have made many mistakes in which we didn't create the experience, it was familiar enough for the users and like we did a few PCs and that we felt that we provided a good user experience. And then this analyst comes in and say, Hey, I don't care like what about this UI that you have built? I know sequel give me sequel. That's what I understand. And once we added that we found people starting to take like the customers doing migration from data warehouse to data like you're actually taking their sequel from the data warehouse copy paste them into app server and start tweaking them and that for me, it blew it blew my mind and that is using it like this like this never imagined that once you gave them the familiar option. They started using the system in ways I didn't think about and like you You can take it to other features of the fact like why is up solver price per compute, when you just got started with try to prices by volume of data, but usually data processing solution or cost This could be pleasantly surprised by compute. It's not that it's much easier to understand, but everyone already understand. So I think that they also deployment. By the way, the way we deploy today is we give you a cloudformation script, because that's how people deploy software on AWS. So working with familiar, like giving an app, putting an emphasis on familiar was maybe my biggest takeaway from the last couple of years.
Yoni Eini
0:47:24
Yeah, I think also, like, you know, you have, you have a ton of different features. And, and it's very hard to concisely explain. I mean, there's so much complexity there. And there's so much stuff going on so many different capabilities. And then when you tell someone, well, it supports SQL, like that's really packing a lot, a lot of complexity into a very short statement because they already know what to expect.
Tobias Macey
0:47:47
And as you plan for the future of the business and the look to the current trends in the industry for data lake technologies and usages of data lakes, what do you have planned for your roadmap
Yoni Eini
0:47:59
site? I'd say that that definitely everything around portability. So today we're really AWS focused, but we don't want to be exclusively for AWS customers, we want like kind of anyone who has a data lake or wants a data lake to be able to do it if they're on prem, if they're in Asia, or wherever it is. So I think that's a big thing is is just making sure that that that really everyone can have access. There's also data lakes have kind of a unique challenge, which is that and I'm not sure how much it's affecting data lakes today, but I think this is definitely something going forward that's going to be more and more of an issue is that you have GDPR compliance issues. So data lakes are very good for storing large amounts of data but they're very unwieldy in the sense that you then if you want to delete something, it's it can be almost impossible and and I think maybe today even many organizations are just kind of giving up and saying Well, okay, it's in the data lake and maybe that's okay. I hope nobody sees me and I think that that's but that's not going to fly too long forwards So I think that that building around and we actually have some some pretty interesting solutions to these problems as far as the data lake management. And I mean, I mean, again, since up solvers, a data lake management platform, it's like kind of trivial that we would be the ones that would enable these features. And I think that's going to be super important going forward. And again, they're kind of they're features that exist today. But they're not wrapped in a way that like, okay, you have a GDPR compliance button, you can just click it and then delete user, like PII. So that I think that's, that's something that we're that we're definitely looking at focusing on. And then and then maybe the last thing I would say is, is just focusing on educating users. So like everything, from documentation, to tutorials, to webinars, to like workshops and training sessions, like everything you can think of, but just making them and I think like, you know, part of it is training on up solver, but part of it is a lot of the stuff we talked about today is like how to think about streaming data, or think about data as streaming data. So that's an Another big focus of ours is just to make sure that people can can get it excessively. Part of that is making sure the platform is as easy and familiar as possible. Part of it is making sure that there's a lot of information around it so that they can figure out what's going on. All right, are there any other aspects
Tobias Macey
0:50:15
of the work that you've done on up solver or using SQL as the interface for data lakes, or just overall data lake technologies and usage that we didn't discuss they'd like to cover before we close out the show?
Yoni Eini
0:50:26
I think we touched most of the most of the points. I mean, I'd want to say that I'm super like, you know, super excited about what's going on now. Like, you know, a year or two ago, data lakes were on everyone's lips. And now today, data lakes are in everyone's AWS account. So that's already a pretty huge thing. But I think that people don't understand the magnitude of the shift that's going to happen. Like I don't think they realize I mean, today still, I think databases probably account for more data, or maybe not, maybe maybe at this point data, lakes have more data than data database. But I mean, as data, lakes grow exponentially, and databases grow linearly, really all of the world's data is going to be in Italy, all of it. Like it's going to be a rounding error. Anything else and, and so I think like having all of the different capabilities that you expect from the data lake, I mean, it has to happen. It's not like I'm also going to have all these other things like storage has to be in the data lake because it's just so much more competitive from a cost perspective. But I think like, I think this is like a really exciting time to be talking about this stuff.
Tobias Macey
0:51:30
All right. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I would like to add to your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Ori Rafael
0:51:46
I think you're kind of playing it both. Both extremes, like if you're talking about data management, you either have the databases option that gives you the the ease of use advantages, but can I have all the The pitfalls when it comes to scale and the ability to not be like ability to use additional systems. So and the data lake is still very, very complex. So spark replaced a dupe, but it's still complex as a dupe. So that's kind of why up solver is very focused on that complexity as the as the main problem that we want to solve. I think if I would add to that I would add the metadata management. I think the fact that each database is going to manage its own metadata is something that's going to change since we have so many different databases and query engines and concepts like no glue data catalog and AWS hive meta store. So centralized metadata management, where you're also doing centralized security management is something that's going to take a much bigger place going forward. Yeah,
Yoni Eini
0:52:48
and I'd say from my perspective, so so I think that today, data lakes are really only addressing a very small portion of the actual business use cases. So like often they're going to be an in interim step, or maybe like an end target for first, like long term storage or something, but like today, the users that are getting a lot of value out of the data lake tend to be data analysts. And I mean, and of course, you know, that's like a huge market. And that's like, you know, our bread and butter today, but I think that like going forwards more use cases. And that's something that I don't see today is you don't see many people saying, like, you know, data lake for OLTP, or data lake for like, yeah, I mean, like data lake for stream processing for that better. Like, I think that these things should happen, because it's still, I mean, yeah, for stream processing, in a way, you know, Kafka are saying that now, when they're releasing their s3 extension, they're kind of saying, Well, okay, yeah, actually, like, why do you need a giant Kafka cluster when three nodes can deal with 2 million events per second, but but the reason it's big is because of all the historical data. So keep a tiny bit of the data in the live Kafka and put everything else on s3. And if that's seamless, it's Wow, suddenly Like so much cheaper and better. So I think that just like adding more use cases on top of the data lake is something that is really nascent. It's really like just just starting. And I think there are a lot of new, interesting things that can happen there. And
Tobias Macey
0:54:14
one thing as well that I see as being a big gap in the space is around being able to test and validate your ETL logic and being able to run it through a CI or CD pipeline to ensure that you're not injecting errors into your overall transformations and being able to do a before and after comparison and ensure that the work that you're doing is what you anticipated and what you actually wanted. Give them the real world data.
Yoni Eini
0:54:40
Yeah, no, I totally agree. I mean, that's, like I think that's probably one of the one of our biggest focuses in our system is around previewing the data as much as possible and allowing you to very quickly iterate and and be able to I 100% agree with you there. I mean, especially because actually running an ETL process is very Expensive like there's a huge cost associated with getting it wrong the first time
Tobias Macey
0:55:04
Yeah, that's probably an entire episode on its own. So with that, I'd like to thank the both of you for taking the time today to join me and discuss the work that you've been doing with up solver and trying to make the data lake a manageable and enjoyable experience. So I appreciate the work that you're doing on that front and I hope you enjoy the rest of your day.
Ori Rafael
0:55:22
Thank you very much, you too, and thanks for having us.
Tobias Macey
0:55:30
Listening Don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language its community in the innovative ways it is being used and visit the site of data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast comm with your story and to help other people find the show, please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!