Building A Data Lake For The Database Administrator At Upsolver

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

What advice do you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it?

I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcast.com/97

things to add your voice and share your hard earned expertise.

When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you get everything you need to run a fast, reliable, and a bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/linode.

That's

l I n o d e today to get a $20 credit and launch a new server in under a minute.

And don't forget to thank them for their continued support of this show.

You listen to this show because you love working with data and want to keep your skills up to date.

Machine learning is finding its way into every aspect of the data landscape.

And Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program.

In this online project based course, every student is paired with a machine learning expert who provides unlimited 1 to 1 mentorship support throughout the program via video conferences.

You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production, and managing the life cycle of a deep learning prototype.

SpringBoard offers a job guarantee, meaning that you don't have to pay for the program until you get a job in the space.

The Data Engineering podcast is exclusively offering listeners 20 scholarships of $500

to eligible applicants.

It only takes 10 minutes, and there's no obligation.

Go to data engineering podcast.com/springboard

today and apply. Make sure to use the code AI springboard when you enroll. Your host is Tobias Macy. And today, I'm interviewing Ori Rafael and Yoni Aini about building a data lake for the DBA at Upsolver. So, Ori, can you start by introducing yourself?

Sure. So I'm I'm the CEO, 1 of the cofounders of Upsolver,

coming from a DBA, big

data integration background. And, Yoni, how about you? Hey. I'm, I'm Yoni. I'm the, CTO and the other cofounder of Upsolver.

Most of my,

experience before Upsolver is around,

data science, big data preparation, streaming data, all sorts of stuff like that. And, and then, of course, in Upsolver, building a a a high scale data lake platforms.

And, Ori, how did you first get involved in the area of data management?

I was working on a on a data lake, with Yoni. We were trying to solve an advertising optimization

problem

over the data lake, and we found ourselves spending a lot of time, a lot of data engineering time just so we'll be able to query and work with the data. So I was think I think I was a good DBA, but I didn't really have the ability

to go and work with the data, like, directly.

So that's when we started started working with data management, and that's when we kinda talked about the idea. And, Yoni, how about you? How did you first get involved in data management? So I think for me, it's always been something that has been top of mind. So from,

I'm like, my first job was in the IDF, and I was a a DBA.

Then after that, as a data scientist and a researcher and,

applications developer, finally CTO there. And, like, throughout everything I mean, everything was data there. Everything is

large vol I mean, well, small volumes from today's perspective. But from then, it was it was large volumes of streaming data,

a lot of kind of tricky decisions on how to manage it and where to put it. So it's always been something that was super interesting for me. And then I think that,

in Upsolver, kind of like I think we we very, easily gravitated towards that of

of of taking on this hard problem of, like, how should this data actually be managed? What are the best practices? And do people really need to worry about it at all? Or or should it just be done kind of? Is there a right way to do it? So, Yoni, you've actually been on the show about,

almost 2 years ago now in November of 2018

when we talked a bit about what you're building at Upsolver there. And so it's this platform for managing data lakes in the cloud. And before we get too far along in the conversation, can you each give your definition of how you consider what a data lake is and what it's comprised of? Sure. So I think that that would the best the easiest way to think of a data lake is that it's a decoupled database. So I'm taking the storage part, the metadata part, the compute part, and

kinda starting to manage each 1 separately,

which gives me a lot of advantages when it comes to elasticity and cost management and scaling.

And, on the other side, it creates a lot of complexity.

That would be my definition. Yeah. And I think I think for me, like, in the end, the data lake is a cost thing. It's, you know, data as data volumes grow you know, databases are great. They're really, really good at what they do. But as data volumes grow, it just becomes too expensive. And the cost also means how much data you can deal with. So if I have a database that costs me $1, 000

per terabyte,

then then at 50 terabytes, per month at 50 terabytes, I'm thinking to myself, well, maybe I maybe I don't wanna store this data anymore. And then and then along comes the data lake, and it really opens up your capability to deal with a lot more data just because it's so cheap. So it's really like I mean, in the end, the the the bottom line really makes the whole difference. So I'd say that for me, like, a data lake is is the natural progression

towards,

larger and larger datasets. And and I think it's very important to think about that as, you know, the data lake shouldn't be a compromise. Like, I think maybe today for a lot of people, it is a compromise that they would rather have the data in a database. But, like, it should be as powerful and as useful as a database. So then from my perspective, that's that would just

since I see it as that's the case, it would just be the natural progression that things just go there. And some of the initial ways that the data lake started to manifest was with the Hadoop

system and the MapReduce project and the HDFS file system for being able to spread your data across multiple machines.

And nowadays,

a lot of people are using object storage for the actual storage mechanism.

And even in the past 2 years, there's been a lot of movement

availability

of different technologies for working with data lakes and different managed platforms such as yours for being able to build out a data lake with a single layer. And I'm wondering

how the overall landscape of data lake technologies and the overall adoption of data lakes as a

solution for businesses has changed since the last time that we spoke. I mean, I can I can jump in and say that, you know, from my perspective, I think the tools have become a lot more powerful? So if 2 years ago, there were a few compromises that you really had to make because, you know, in the end, if you're talking about, like, a data lake, so it's not it's not an SSD.

You have an access latency.

Of course, there's all the eventual consistency issues of your storage and things like that. And I think that over the past 2 years, these issues have pretty much been resolved. So, like, today from object storage, you get performance that's equivalent to SSDs.

And, I mean, the cost hasn't gone down much in in that time, but just the fact that the performance is so much better today means that I think that pretty much any use case you can think of, you could solve using a data lake, given the correct, data modeling. May maybe the other thing to bias is the popularity.

So today, I feel like, like, 2 years ago, I was still explaining data lakes in many cases, and today, like, everyone has a data lake agenda.

And, like, if you look at, like, the top companies in the market today, still look at, like, the Spark adoption.

Like, everyone are doing,

Spark, like, much more comparing to 2 years ago. And if you look at the the big data warehouses,

so they are trying to stop calling themselves, you know, they we are a data warehouse. We are we are a data lake, or we are a data platform. And they try to add capabilities

query both their traditional model, but also query the data lake directly. And I think you see it across all the big vendors. So you have Redshift with Redshift Spectrum, giving them the data lake capabilities,

and you have the launch of Azure Synapse Analytics with serverless queries

over Blob storage, and BigQuery has the BigQuery external tables in beta. So,

Snowflake calling themselves the data platform today and not the data warehouse cloud data warehouse like they did in the past. So I think that's a big change when everyone kinda want to succeed with the data, like, use cases with different types of solutions. And how do the,

cloud data warehouses such as Snowflake or BigQuery differ from the

full fledged data lake in terms of what they're available to offer and maybe some of the cost issues or performance

capabilities compared to using the native data lake technologies, whether that's things like Presto and Spark or managed platform?

I'd say that there are 2 major advantages that you have to a data lake. 1 is the cost. And I think that both BigQuery and Snowflake are kind of addressing the cost. I mean, of course, there's a cost of the platform

itself, but but, like, beyond that, there's still like, the cost per storage is gonna be similar to a data lake. So I think they're really, like and by the way, you see that across a lot of vendors. I mean, also Kafka now have a a a or I'm not sure if it's out yet or they're releasing it, as we speak, but they have, like, the s 3,

extension to the Kafka stream.

Or, or you have, like, yeah, as,

like, Redshift using Redshift Spectrum. So they're all kind of adding these extensions to the data lake and and enjoying the cost. But specifically, when you talk about Snowflake and and and BigQuery,

they're the data's sitting inside

these systems. So once it goes in, it doesn't come out. And I think that really breaks the the second big advantage of the data lake is that you can put it before other systems. So it goes in the middle of your architecture.

All your data comes in. It goes into the data lake, and then you can do whatever you want with it and send it to whoever wants to consume it. And with BigQuery and Snowflake, you really pretty much need to consume it within BigQuery and Snowflake. And I think that really limits the flexibility of what you can do. And and also, like, there are a lot of use cases. Like, I mean, if you want to use a key value store, you can't really like, there's no way to pull the data out of Snowflake to put it into a key value store, really, that's that's gonna be effective or or or or, like yeah. Exactly. Yeah. Think as a as a DBA,

like,

talking

talking all the HDBA,

everything we solved everything with Oracle because our data was in Oracle. So today, it's called BigQuery or Snowflake, but if they their data is there and it's in a proprietary format, I'm gonna take go to great length to solve the problem there and not to replicate the data to additional database. And Data Lake gives me that ubiquitous

access

option that they don't have with database that come with a proprietary format. And so in that same time span of roughly the past 2 years,

in addition to the changes in data lake technologies and its overall adoption, what are the ways that the upsolver platform has changed or evolved since we spoke? And how has the evolution of those underlying technologies

impacted your strategy for implementation

and the features that you decide to include?

So I'd say that that I mean, last time we spoke, that was in 2018. That was, like, the our first year of general availability.

So, I mean, of course, the platform has changed a lot as far as just maturity and kind of scale, how much data we can deal with, how much data we are dealing with. So, like, today, we have customers that are putting 2 gigabytes per second to the system, which was, like, a far cry from where we were then. And I'm sure it can scale much, much beyond that, but, so far, that's the biggest. But then and then I think that the main differentiation

the main difference

platform today is that today we have,

SQL as a definition language on top of the UI based architecture. So it used to be just UI. It used to be that you could define everything using user interface, and you had a very broad set of capabilities.

But today, you have full SQL on top of your data. And and I really think that that's that's a a game changer because in the end, I mean, UI is very nice. It it democratizes in a in a way that even SQL can't necessarily because not everyone knows SQL. But in the end, SQL is the language of data. I mean, even if like, no matter what ETL process I'm doing, in the end, I'm most likely gonna be querying that the result using SQL. So it's really, I think, a a huge difference that they have 1 language for the entire for the entire pipeline. And then the second question I mean, about the the underlying technologies.

So the fact that s 3 I mean, I'm not sure how much people are aware of this, because s 3 was always really powerful.

And, like, I'm not sure how many how many people are actually pushing the limits of s 3. But it used to be a few years ago, they had, like, a best practices. You wanna have these,

prefixes to your buckets that are hashes, which completely didn't fit with anything else that f 3 does, but it was kind of like a performance optimization they were telling you to do. And about a year and a half ago, they released a new article saying, well, actually, you don't need to do that anymore. And by the way, remember that we told you that it's it's good for a 100 requests per second? So now it's good for 5, 500 requests per second. And you can scale it out between different prefixes in the bucket and between buckets, so you can multiply that by 10 if you want. So I think the difference between 10 a 100 requests per second

and 5, 500

or 55, 000 requests per second to your s 3, your blob storage layer, is basically, it means that you can do anything.

Like, you really don't need anything aside from this super cheap storage, which wasn't wasn't necessarily the case 2 years ago. So I think that's a very big difference.

And so in terms of

the adoption and implementation

of data lakes,

what are some of the common challenges that accompany that? Whether they're using a managed platform or doing some,

self built system using a lot of open source technologies? What are some of the difficulties or complexities

that arise regardless of the actual technologies that you're building on top of? I think if I'm paraphrasing,

the fact that I'm using a data lake, what gotchas are there? What

like, how is that making my life miserable?

So I think that in general, in a data lake, you really need to worry or at least traditionally, you would really need to worry about the low level stuff. It's not like, you know, in a database, sure, you have a DBA and you have to build your indices and things like that. But, I mean, in the end, you're not worrying about how the database is storing the files or or or if there is, like, replication going on or what's happening behind the scenes to make sure that it's load balancing or stuff like that. All that stuff is handled for you by the database itself. With a data lake, it's even worse than that. Like, you don't only need to worry about the load balancing and the and the and the, like, where you're gonna store it and how you're gonna store it. It's also in the end, you're using this very like, I mean, I was just saying how powerful it is, but it's actually very weak in the sense that you don't have a lot of capabilities of of discovering data. I mean, you pretty much need to know where it is. So you have this file system, and you have to figure out your folder structure. You have to figure out your file formats. You have figure out your compression. Like, all the stuff that a database would normally do for you, you need to do on your own. So,

it's like all these, you know, triggering. So process management and and orchestration and and and if you want state, where to keep the state. And,

like, in the end, it's a mess. So so it comes out that that, like, 90% of the time, you're working on kind of, like, making sure that your data lake is performing

as a database should or performing as, the storage system of of a good system would be. And kind of just, like, you know, putting it in place and and preparing that and and and and and massaging it, making sure it's good. And then 5% of your time is actually doing stuff with the data. It's it's consuming it. Like, often, it's gonna be different people doing the 2. So then, the people who are consuming the data lake are just gonna be waiting for people to to implement stuff for them. But but, yeah, I I think that's I I mean, on prem, it's even worse because then you even have to manage the storage itself. I mean, Hadoop isn't the like, HDFS isn't the easiest thing in the world to manage. But even in the lake, and you have s 3 and that's super great, but but you still need to do a lot of work to make sure that you're that everything works as it should. I mean, it doesn't really add up in the end. I mean, you hear that a lot about data swamps, that you wrote data into your lake, and then it became a swamp because you can't access it. You lost the data, basically. And and it's even worse because you're paying for it, and you can't even get really delete it because you don't know where it is. So I think that would be the common challenge is that you actually have to build the data lake, and you're not just consuming a a kind of prebuilt

product, which is a data lake. So Upsolver, I know, aims to solve a lot of those challenges of ensuring that there is some appropriate modeling going on and that the data as it's being written into the lake is optimized for access and scale.

And you mentioned too that with the database, you generally don't need to worry too much about what's going on under the covers because it handles a lot of it for you. But for people who are using a data lake, particularly if they're using something like Upsolver,

what are some of those underlying realities of the data systems that power the lake and power your platform that still need to be understood by the operators to ensure that they don't accidentally shoot themselves in the foot? In AppSolver, pretty much, they only think about how they're gonna do transformation of data. So they're

pretty much mapping their raw data into tables. So we're not not giving them any control over

orchestration

or how exactly data is. We give some configuration control, but the idea is that the approach is don't let the users do what they shouldn't do. Don't

don't let don't let them trip. So you're kinda asking where they could still trip, and I think it's more of, like, the concept of how you're gonna eventually organize the data. So I'm, as a DBA, I'm always used to creating relational models. So I'm creating the a relational model, and then the user can answer their questions as views on top of that model. But if you go to a data lake, you can't create a relational model because data lake is not indexed. All of those joins will not work well. So you should pretty much take your raw data and map it directly

into tables that are not necessarily

relational. And I'll try to maybe illustrate that with an example. So let's say I'm working on an advertising problem. I have a table of impression and a table of clicks. So I could just go and create a table for each 1 and then try to join them, and it will just not work. But the way I would approach it in a data lake, I would create a table that includes both the impression and clicks and let the user query that. Although it would consume more storage, storage in is cheap. It will be much more beneficial from a query perspective. Yeah. I think, like, you know, when when I think about databases in the end, I mean, there are 2 things in a database that I need to worry about. 1 is the data model. So making it, like, kind of making it consistent with itself and making it, like, correct and making it useful. And then the second is that I don't run out of space. Like, that that's pretty much the 2 things I care about. Like, I can't run out of space and I and I have to make sure that the data is well modeled. I think in a in a data lake, I mean, you don't have that space constraint. You don't need to worry about running out of space. So then you can you can solve problems in a in a much easier way, I would say. And, yeah, like, the denormalization

is 1 of those things. I'd say a main thing that, like, in the data lake, I mean, you don't need to worry about that. I mean, it's actually simpler. It's it's closer to how people actually think about stuff rather than than than a traditional database. And with this SQL layer that you introduced, you mentioned that there was always a

visual workflow for being able to work with the information in the lake. But how does the introduction of that SQL capability

change the accessibility

and the overall

staffing options for people to manage the data lake? It's it's eventually 20 times more users. So if you can like, we tried to do this research. We went to LinkedIn looking like how many people know SQL, like, how many people know data engineering or Hadoop or those kind of technologies, you see that, like, the number of data engineers in the world is, like, 2 2% to 5% of all data practitioners.

And, like, also take a look at, like, the growth in number of data lakes. So if I would, look at on prem Hadoop, like, combine all the customers from Cloudera and MetPal and Hortonworks together, I think I'll I'll reach to less than 10, 000, like, paying customers

for, like, pay pay distribution of Hadoop. And I'm looking at Amazon s 3, and there are over a 1000000 customers. So you have, like, hyper like, an exponential growth in data. Like, you definitely don't have an exponential growth in number of data in number of data engineers.

So I think the maybe the most the biggest most important thing is app server is the ability to go to 20 times more users that are getting direct access to the lake and don't need to drive the entire process

to to a t or through biller engineering. And and I think, like, when you're talking about SQL so, you know, it's very easy. A lot I mean, a lot of other systems, like,

I mean, I don't wanna,

like, bad mouth anyone. But a lot of systems would say that, okay. We support SQL. And and, like, in reality, they kind of support okay. Like, you know, you take your select from where

and and and, like, be thankful for the 5, 5 built in functions that we give you. And and I mean, okay. Sure. That that kind of is supporting SQL. Like, it means that you're still defining,

your transformations in SQL, but it's not really that useful. In the end, it's very much limiting your capabilities

in order to,

fit into the language. Or rather, like, the maybe the capabilities were limited to begin with, and then the language is just reflecting that. I think there's a big challenge in making sure that you're actually, like, both giving fully fledged SQL. So actually having all the capabilities that someone would expect from SQL, which

is not trivial when you're talking about streaming data, especially, and and also not on top of the data lake. And and then and then also,

making sure that people understand that

given that they're defining their ETLs in SQL, they're not losing any capabilities. It's not like if I were to write things in code,

SQL would be like, I would have more I would be able to do other stuff or do better stuff or do do what I want, and I can't do it in SQL. Like, you kind of need to reeducate people. And, I mean, in databases, they were already convinced that people switched there because they they understood that they could do all their data stuff in SQL.

You kind of need to convince now that that that actually you don't need the code in Spark. You can do it in SQL, and

it it will actually do everything you want. And so in terms of the actual

ways that you have implemented SQL, I know that you're using the ANSI standard, but what are some of the useful extensions that you have incorporated

that are relevant to the data lake context that simplify the work of your users?

So I'd say there are 3 very powerful extensions

that

I mean, the first 1, I kind of wish was part of, ANSI to begin with. So SQL is is super nice for data modeling and everything, but it's very hard to build functions on top of 1 another. So for example, I wanna do a pipe b. Like, I wanna just concatenate 2 strings. That's fine. And then let's say I wanna concatenate 2 strings and then do something else on top of that. So you can also do that, but then you can't reuse it anywhere else in your statement. So often you're gonna have, like, in your select statement, you have a state you have a function, then you have the exact same function in your group by or in your where clause just because there isn't any composability there. So 1 extension that we added is basically, you can think of it as a procedural set statement. So set

field name equals whatever transformation.

And that that, like, single small addition to SQL

allows you to define

complex transformations

and then consume them in additional

like like, down the stream, in additional parts of your SQL statement. And I think that, like I mean, SQL's answer to that is using subqueries.

So you define a bunch of transformations in your innermost query and then use another subquery to define additional transformations on top of that, and, like, it becomes horrendous. It's very hard to read that kind of statement. So I think using the set statements really simplifies kind of and especially, like, often in SQL for ETL specifically, you'd have, like, 10 or 15 different data transformations. It would really become a nightmare without that. So I think that's

1 very, very big difference as far as the,

capabilities of language. A second thing is that so SQL is is

traditionally made for,

you know, for for relational databases which are flat. Like, your tables are flat. And then eventually, they added JSON support, and they added, like, nested column support and stuff like that. But it's very clunky, and nobody knows about it. Like, nobody knows actually how to interact with nested data using SQL. So we added a few language extensions around

dealing with nested data within the original structure. So, I mean, conceptually and this is, maybe it's a bit hard without drawing it. But, like, conceptually, you have let's say you have a a a purchase, which is a JSON.

And and this is how raw data comes in. Like, 90% of the raw data we see in the world is JSONs. So you have your purchase, which is the the root of the object, and then you have an array of items. And each item might have, let's say, a quantity and a price per item or something like that. And now I wanna reason about these things. And usually, what you're gonna do is all your transformations

are either happening

happening at the item level or they're happening at the purchase level. So I might wanna say multiply the quantity of the item by the, price of the item. And then, obviously, I want to multiply within the same item. I want to scope that that calculation

to each individual item. I don't wanna, like, take, I don't know, all of the as would, by the way, happen in native SQL. I don't wanna take all of the all of the prices and do a Cartesian product with all of the quantities and then multiply everything together and get, an array of n squared,

size. Like, that doesn't make any sense. So we have a line which is a language extension, which is very subtle, which simply allows you to access fields within arrays directly. But when you do that in Upsolver,

it makes sure that all the transformations that you do are scoped correctly based on kind of, like, if I'm multiplying 2 values and they're scoped together in a field, so it'll multiply within that field and not out of it. And that that really makes a bit a very big difference as far as the ability to I mean, because, again, like, people, you know, people working with SQL, obviously, there's, you know, there are the SQL superstars, and there are the people who kind of like, you know, SQL is a second language. So you don't wanna force the people with SQL as a second second language to just not ever deal with their nested data. Yeah. And I think I think the UI also helps there because it it kind of exposes the syntax in a friendly way. When you add it from the UI, you can see what the SQL generated by it looks like. And then the third thing, and this is getting to be a pretty long answer. But

but the third thing I would say is that is that SQL generally deals with static data. So you have kind of a table. The table is finished. I mean, it's not. Always in SQL databases, new data is coming in all the time. But but in essence, when you run a query, there isn't really a a built in temporal aspect.

Whereas when you're dealing with data lakes and streaming data, there has to be a built in streaming aspect because I have to deal with everything incrementally.

So when an event arrives, I need to when I'm joining, I need to reason about when did that event arrive and

what portion of the lake and I'm am I joining into. So we added a few language extensions around being able to bring in, for example, the latest data or waiting for a few minutes in order for the data from the other stream to catch up or things like that. So there are a few, additional keywords that we added, which allow you to kind of seamlessly deal with streaming data without needing to build huge sub selects

that do those constructs.

And, yeah, there's definitely been a trend that I've been seeing in the overall data space of this move towards streaming SQL. So a lot of that is implemented on top of things like Kafka streams with ksqlDB,

or there's the Aventador

platform for handling streaming data using SQL. Yeah. There's also the materialized platform. There's a lot of

open source implementations.

And so it seems like there's this sort of implementation

specification of how streaming SQL works, at least conceptually. I'm sure that there are variants in terms of the specifics of the syntax. And I'm wondering what your thoughts are on

the overall space of streaming SQL and the what you see as being the future of that in terms of incorporating it into the broader specification of SQL and its necessity given the current ways the data is being used? Yeah. So I think I mean, first of all, you have to keep in mind that the SQL specification

is like, how do I put this nicely? It's maybe a bit of a legacy thing. I mean, I know it's maintained. You have SQL 2016,

and you have a lot of versions of SQL, and they come out with a lot of new stuff. But really, people are stuck in in 1999.

It's not like,

so okay. Common table expressions kind of became standard. People know about them more or less. But a lot of the new features that you have in the SQL standard, they're actually not very standard. Like, none of the databases actually support them. And, like, even and even the old features, you know, you have a lot of stuff in there that that that is completely irrelevant now. Like, all sorts of, like, Fortran support is part of the standard. It's not like it's not that relevant. So I'm not sure how much it's important because, anyways, nobody like, up until 10 years ago when you weren't talking about streaming SQL, just SQL, still there wasn't actually a standard. The standardization was was very, like was only skin deep. So I think that today, it's it's kind of the same. You know? Everyone has their own flavor. They all have their own extensions and their own additional

kind of keywords and and and how they reason about things. I'm not sure that's the end of the world as long as people

kinda stick to the basics and make sure that at least, like, when someone writes a SQL statement that's that's kind of like using the common base of knowledge,

of people writing in SQL, that it's gonna do what they expect and not not something weird. And

and then, you know, if you add a keyword here and there, like, I I think it's it's less important that that would actually be standardized just because it never was. So yeah. I mean and then even if you add it to the standard, I mean, let's say if ksql

really, like, does really well or if opsolver does really well, and all these changes I I I just said are added into the SQL standard. I mean, is ksql gonna change? Probably not. Like, I don't think they care that much. So, yeah, I'd say, like, the standardization is less important, but keeping things as concise as possible and as as close to the base as possible, I would say, is maybe the most important thing. And then when talking about streaming SQL, so I mean, definitely, this is a huge this is a huge thing. It's a huge departure because because it's not traditionally part of the language. And that's also like you know, it's a very big conceptual shift for data consumers. And I think maybe that's 1 of the challenges that we see is that is that people need to start reasoning about streaming data where they're used to thinking about static datasets. Like, that's a big difference between Upsolver and Spark, for example. Like, Spark is it talks static, and then you can do streaming on top if you want. But but, like, the language is static.

And whereas in upsolver,

the language is that is is streaming and then but then you need to educate around that. People need to realize that, well, you know, today, data is is streaming. That is what it is kind of. There's also the

difference in terms of perspective of the initial broadly

accepted view of the Lambda architecture where you have the separation of batch versus streaming and you process streaming in real time as it comes in, but then you periodically

go and backfill using your batch data to make sure that you have a high level of accuracy.

And then there's been this

proposed

CAPA architecture, which is more focused around streaming and being able to maintain aggregates

based on that streaming data. And I'm wondering what your thoughts are on the overall ideas of Lambda versus Kappa or some of the different ways that you can architect to be able to account for streaming data while still being able to

get some more detailed analysis

based on the data that has actually landed in your lake rather than trying to maintain these windowing functions where you have an imperfect view of the entire history of your information. Yeah. So I mean, I'd say that first of all, like, the problem with Lambda architecture

is that you need to write stuff twice. You really don't want to be writing stuff twice, and especially the fact that like, I mean, Lambda came out of the world of batch when they wanted to add a streaming layer. So you have, like, okay. I already have my batch. Now I'll add an additional streaming layer and figure that out in a separate language or something.

And and so I have my Lambda architecture.

And and then I think Kappa is is kind of like maybe it's a bit naive, but, like, conceptually, it's saying, well, let's discard the past and say all the data is streaming anyways, kind of like what I've been saying now. And and so so actually, like, why do you even need the batch layer? Maybe, like, you know, just use the streaming layer, and it'll do the batch, and everything will be fine. And, of course, you have performance issues and, like, you have to make sure you're not losing capabilities here. But definitely you have to have 1 language. You can't have, like, 2 definitions of your same ETL.

It has to be defined once, and then your infrastructure

needs to convert it to either, like, the streaming layer if you have such a separation

into the streaming layer and the and the batch layer. And from Upsolver's perspective, what we did is is we're really trying to do the we're a data lake platform. So, of course, we can deal with huge amounts of historical data and batch data and all that, but we treat it as streaming data. So in a sense, I'd say that's similar to Kappa architecture in the sense that we only have 1 way of dealing with data, which is streaming, But we do do that in such a way that you can deal with a huge infinite amount of data. So I'd say slightly different from, let's say, how Flink I mean, the main proponents of Kappa would be, I would say, Flink. And the way they do it is kind of like you have a stream of data and you can just, like well, just run that stream from the beginning fast and maybe separate the stream into a lot of different streams. But that kind of requires a lot of preparation in advance. In opsolver, we build the data lake in such a way that you can just run your kappa on top of the entire data lake, very quickly by splitting it up into time chunks and stuff like that, but you're still streaming over everything.

Tobias, 1 thing that you mentioned is the sliding time window and the limitation there. And I think that's, like, 1 big limitation that AppServer addressed. So the the way the streaming systems were built is that you can only address data and work with data that you can currently fit in memory in that time window. So we built an index an indexing system that goes with your streams. So when you wanna combine your real time data, you can combine it with all your historical data and not just the window the the data that you can fit in the window. Our objective here, do everything with the stream and implement CAPA and not

do do not do Lambda. And 1 other thing that I wanna call out from your earlier

answer, Yoni, is the

handling of nested data. Because as you said, that is 1 of the consistent challenges, particularly in data lake systems is

you don't want to have to preprocess

the information a lot before it comes into your lake. And so you do often end up with these nested JSON structures or other formats that have nested fields. And being able to access those in an intelligent

and natural way is something that's a shortcoming of a lot of the platforms that I've tried to use at various times.

And so in the data warehouse approach, you would generally handle that flattening of the nested structures as part of your transformation before loading it into the database. So recognizing

that being able to handle those fields is is definitely a benefit to people who are trying to work with their data without having to do a lot of upfront work ahead of time and potentially lose information

or lose context by flattening things without necessarily

introducing the appropriate

sort of information as to where those flattened fields originally existed.

I mean, definitely, the data lake needs to represent the original data, the data that was generated. I mean, if you're doing data transformations

before dumping it into the data lake, you're doing it wrong kind of. Because I I mean, I I probably shouldn't make such, like, strong statements.

But but any data transformation you do, and like you say, like, you lose the link and then and then it's actually gone forever.

So you really wanna make sure that your data lake is at least your single source of truth. And so

going back to the SQL implementation,

I'm wondering if you can talk a bit about how it's implemented in your overall architecture and just some of the ways that the implementation

of your system has changed since the last time we spoke? So, I mean, the way we did SQL in the end is you can think of it as a as a from the bottom up rather than a top down approach. So it's not like we said, well, okay. We want SQL, so let's just add SQL and see what works. It was more like we had in the back of our minds that SQL is important,

but we didn't want to get to a point where we have a partial SQL language. Like, we didn't want to get to a point where, okay, I support SQL but no joins or SQL and no group buys or, like, things like that. So from from our perspective, what we did is it was kind of like, okay. So first of all, do we actually have a full support for data transformations that you'd expect in SQL? And then we had that. And then it was like, okay. And how about, like,

how about dealing with filtering? Okay. That's that's pretty easy. We had that. How about doing join? So that one's pretty hard. And I think that implementing SQL

made us so, I mean, the the the underlying capabilities were actually already there as far as joining. So the way a join works in Upsolver

is that you build an index of the right hand side. And and that index is sitting in s 3, so it's still in the data lake. But it's a

conceptual key value store equivalent to what an index would be in in in a database. And then when you, when you do a lookup from it, so when you do the join, that join is gonna perform very fast. So that general capability was already there. We already had at the time we called them indexes, now they're called lookup tables. We already had that capability.

But

the adding SQL really forced us to understand

exactly what's necessary

as far as the definition

and boil it down to really the basics. So I can say,

select from

stream, left join this lookup table

on key, and that's it. And then it'll work and it'll do exactly what people expect it to do. So I think it forced us to be a lot more concise

about how you define these things. And then for group by is also so, I mean, group by is actually a pretty interesting,

interesting semantic on a data lake because you kind of have like, in a database, when you run a group by, it runs on the entire query. So you have I have a database with a 1, 000, 000, 000 records, and I say group by key, get me the max value. And I'm not gonna get the max value of the key from the last minute or something like that. I'm gonna get a full SQL statement that just returns for that key the max value over everything. And I might have indices behind the scenes that help me resolve that. But in the end, the statement runs on all the data.

And in a data lake, that's like there isn't really that concept. You know, when I run a SQL in an ETL,

it's like, okay. I'm always appending more data, but, like so what am I grouping by on? What time window is it? Or or and then I'm gonna have duplicate keys of the group by. That's kinda weird when when I run my queries on top of the on top of the data lake. So I think that was, for us, the most challenging part is adding this kind of, like, replacing functionality so that I mean, we support both because, I mean, of course, you know, a lot of people are used to streaming data, and they want this append functionality. So I run I say group by country,

tell me the amount of users,

and I want every minute to just get however many users there were in that in each country. I just wanna add it. And and I'll do an additional group by on my database on my data lake layer. But the capability to say,

select country,

comma, count distinct users,

from my stream

group by country.

And what I actually want is that the result is going to be a table that has 1 row for each country, and and it has the count distinct users in that country. And every minute I want it to update so it'll have the new count distinct values, but that's all I want in the target. That's something that was conceptually super difficult to to accomplish.

And and I think that was really, like, you know, to get SQL working, to get SQL out from my perspective,

we really needed to have that capability. So I think that was probably the biggest challenge from our perspective is is implementing kind of a real group by, on top of a data lake. And so in terms

of when you're onboarding customers,

particularly

now that you have the SQL layer for being able to empower the DBA to be the owner of the data lake, what are some of the

main concepts that you generally need to educate your customers on?

The the the the main 2 things

are the relation relational model that we touched briefly before,

but not creating relational a relational model in the end, but actually create the views

they want to query.

And the other thing is the think of your data as an event stream. So we said that the observer approach is kappa and not Lambda, and we want to do everything as stream processing.

But the way you need to think about is that you currently are looking at the context of an of an event, and you want to connect, like, enrich that event with historical data, and you're not joining static data set. So everything has a time based,

filter, time based

way to think about it. Everything is additive processing and not standard batch. And a lot of people are kinda used to batch, and sometimes it takes some time until you do the the the leap forward to stream processing.

And, like, my experience has been with users that once they do it, they really can't go back, but it takes them time to do it. And what are some of the features

of your platform

or

capabilities or ways of using it that are either often overlooked or underutilized by your customers that you think that they would benefit from using more frequently?

I think that,

since we met and even since you interviewed Yoni and, even before that, we were very focused on analytics use cases. So the person in the end was querying the data, but App Store is is a really good solution for doing machine learning, like streaming machine learning. And we have a few customers that are using it and very happy with it, but we haven't really spent a lot of time educating our users how is it right to build a dataset

from a stream in a way that you can actually productionize

the the models that you create afterwards. So I think that part is kinda over overlooked, and it's something that we plan to change going forward. With the streaming capabilities

and being able to run machine learning models on your platform,

what have you found to be the adoption or viability of being able to do something like reinforcement learning that requires that real time feedback loop to be able to

update the model and update its behavior in real time. Yeah. I mean, exactly.

I mean, that that that that exactly hits the the the nail on the head that, you know, that that's the kind of thing that's super easy to do with upsolver. Kind of ridiculously easy. Like, you don't even think about it and it just happens. Whereas people think of it as as something that I might put it on my road map as as a epic 5 year project to do something like that. But but let's be specific, Johnny. Let's be very specific. The thing is the fact is that you are doing everything with streaming, and the fact that you cannot protect the user from taking data in like, data from the future, like create a leakage

in their model, or the fact that we are doing everything in the stream means that you're not calculating your dataset in 1 way and then doing the calculation of the features for serving in a different way. You have just 1 way to create the features. You don't need to

go to to to to search for

an an unfindable bugs,

of what did I do differently between my batch process and my stream process. That's usually the main issue we saw. Unless you see something else, Yoni. You kind of have to for real time machine learning, you have to have kind of the Lambda architecture traditionally. I mean, it's really traditionally gonna be you have Spark for the offline.

You have some code written in some language for the online. And really these 2 code bases are never gonna do exactly the same thing, and they often don't even have access to the same data at the same time.

So because they're so different

and because machine learning models are very, very sensitive to small changes to how the data looks, It's just like I mean, these projects just don't work in the traditional,

in in the traditional sense. Like, you kind of have to have the the Kappa architecture. You have to have the single way of defining it and accessing the historical data, and and it has to work the same on your offline data and on your online data. It has to work exactly the same. Otherwise, it's just not gonna work. And yeah. And I mean, that complexity just goes away because that's just how upsolver does it. In terms of your experience of building the Upsolver platform and democratizing it for the DBA to be able to handle the data lake and just growing the overall business and technical elements of the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Of of giving it to the DBA? Just the overall of building up solver. I think that,

you can probably ask me both from a business and technological

perspective. Right? Yes. So I think for for me, like, we we we got when we only started, we are trying to always build the best product, always be create the best

engineering solution to every problem, and then you find out that the best solution is not necessarily a familiar solution. It's and it's hard to

keep educating users all the time. It creates friction in both the sales process and the implementation

process. So we kinda change our state of mind to looking at familiar all the time. I think the, like, the addition of SQL to the platform was after we have made many mistakes in which we didn't create an experience that was familiar enough for the users. And, like, we did a few POCs, and, like, we felt that we provided a good user experience. And then this analyst comes and say, hey. I don't care, like, about about this UI that you have built. I know SQL. Give me SQL. That's what I understand. And once we added that, we found people starting to take, like, the customers doing migration from data warehouse to data. Like, they're actually taking their SQL from the data warehouse, copy paste them into App Store, and start tweaking them. And that for me, it blew it blew my mind. And, that they're using it like the like like this, I never imagined that once you gave them the familiar option, they started using the system in ways I didn't think about. And, like, you you can take it to other features of the platform. Like, why is AppSolver priced per compute?

When you just got started, we tried to price it by volume of data, but, usually,

data processing solution or cloud is comp place is priced by compute. It's not that it's much easier to understand, but everyone already understands it. So I think that they also deployment, by the way. The way we deploy today is we give you a cloud formation script because that's how people deploy software on on AWS.

So working with familiar, like, giving an putting an emphasis on familiar was maybe my biggest product takeaway from the last couple of years. Yeah. I think also, like, you know, you have you have a a ton of different features, and and it's very hard to concisely

explain.

And there's so much complexity there, and there's so much stuff going on, so many different capabilities.

And then when you tell someone, well, it supports SQL. Like, that's really packing a lot a lot of complexity into a very short statement because they already know what to expect. And as you plan for the future of the business and the look to the current trends in the industry for data lake technologies and usages of data lakes, what do you have planned for your road map? So I'd I'd say that that,

definitely everything around portability.

So today, we're really AWS focused, but we don't want to be exclusively for AWS customers. We want, like, kind of anyone who has a data lake or wants a data lake to be able to do it if they're on prem, if they're in Asia or wherever it is. So I think that's a big thing is is is just making sure that that that really everyone can have access. There's also,

data lakes have kind of a a unique challenge,

which is that,

and I'm not sure how much it's affecting data lakes today, but I think this is definitely something going forward that's gonna be more and more of an issue is that you have, GDPR compliance issues. So data lakes are very good for storing large amounts of data, but but they're very unwieldy in the sense that you then if you want to delete something, it's it it can be almost impossible.

And and I think maybe today, even many organizations are just kind of giving up and saying, well, okay. It's in the data lake and and and maybe that's okay. I hope nobody sues me. And I think that that's but that's not gonna fly too long forwards. And then so I think that that building around and we actually have some some pretty interesting solutions to these problems as far as the data lake management. And, I mean, I mean, again, since Opsolver is a data lake management platform, it's it's, like, kind of trivial that we would be the ones that would enable these features, and I think that's gonna be super important going forward.

And and again, they're kind of their features that exist today, but they're not wrapped in a way that, like, okay. You have a GDPR compliance button. You can just click it and then delete a user,

like a PII.

So that I think that's that's something that we're that we're definitely

looking at focusing on. And then and then maybe the last thing I would say is,

is just focusing on educating users. So, like, everything from documentation

to tutorials to webinars

to, like, workshops

and training sessions. Like, everything you can think of, but just making them And and I think, like, you know, part of it is is training on Opsolver, but part of it is a lot of the stuff we talked about today is, like, how to think about streaming data,

or or think about data as streaming data.

So that's another big focus of ours is just to make sure that people can can get it accessibly.

Part of that is making sure the platform is as easy and familiar as possible. Part of it is making sure that there's a lot of information around it so that they can figure out what's going on. Alright. Are there any other aspects of the work that you've done on Upsolver

or using SQL as the interface for data lakes or just overall data lake technologies and usage that we didn't discuss that you'd like to cover before we close out the show? I think we touched most of the most of the points. I I mean, I'd want to say that I'm super, like, you know, super excited about what's going on now. Like, you know, if, a year or 2 ago, data lakes were on everyone's lips,

And now today, data lakes are in everyone's AWS account.

So, I mean, that's already a pretty huge thing. But I think that people don't understand the magnitude of the shift that's going to happen. Like, I don't think they realize I mean, today still, I think databases

probably account for more data or, maybe not. Maybe maybe at this point, data lakes have more data than data databases. But, I mean, as data lakes grow exponentially

and databases grow linearly,

really all of the world's data is gonna be in data lake. All of it. Like, it's gonna be a rounding error or anything else. And and so I think, like, having

all of the different capabilities that you expect from the data lake, I mean, it has to happen. It's not like I'm I'm also going to have all these other things. Like, storage has to be in the data lake because it's just so much more competitive from a cost perspective. So I think, like, I think this is, like, a really exciting time to be talking about this stuff. Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think you're kind of playing it both, like, at both extremes. Like, if you're talking about data management, you either have the databases option

that,

give you the the the the ease of use advantages, but can have all the the

the the pitfalls when it comes to scale and ability to not be like, ability to use additional systems. So and the data lake is still very, very complex. So Spark replaced the dupe, but it's still complex as a dupe. So I think that's kinda why AppSolver is very focused on that complexity

as the as the main problem that we want to solve. I think if I would add to that, I would add the metadata management.

So I think the fact that each database is kinda managing its own metadata

is something that's going to change since we have so many different databases and query engines and concepts like, you know, Glue Data Catalog on AWS, Hive Metastore. So centralized metadata management where you're also doing centralized security management is something that's going to take a much bigger place going forward. Yeah. And and I'd say from my perspective so so I think that today, data lakes are are really only addressing a very small portion of the actual business use cases. So, like, often they're gonna be an interim step or or maybe like an end target for first, like, long term storage or something. But, like, today, the users that are getting a lot of value out of the data lake tend to be data analysts. And and I mean and of course, you know, that's like a huge market and that's like, you know, our bread and butter today. But I think that, like, going forwards,

more use cases.

And and that's something that I don't see today. You don't see many people saying, like, you know, data lake for OLTP or data lake for for, like, yeah. I mean, like, data lake for stream processing for that matter. Like, I think that these things should happen

because it's still I mean, yeah, for stream processing in a way, you know, Kafka are saying that now when they're releasing their s 3 extension.

They're kinda saying, well, okay. Yeah. Actually,

like, why do you need a giant Kafka cluster when 3 nodes can deal with 2, 000, 000 events per second? But but the reason it's big is because of all the historical data. So keep a tiny bit of the data in in the live cup and put everything else on s 3. And if that's seamless, it's, wow, suddenly every like, so much cheaper and better. So so I think that just, like, adding more use cases on top of the data lake is something that that is really nascent. It's really, like, just just starting, and I think there are a lot of new interesting things that can happen there. And 1 thing as well that I see as being a big gap in the space is around being able to test and validate your ETL logic and being able to run it through a CI or CD pipeline to ensure that you're not injecting errors

into your overall transformations

and being able to do a before and after comparison and ensure that the work that you're doing is what you anticipated and what you actually wanted given the real world data. Yeah. No. I totally agree. I mean, that's I like, I think that's probably 1 of the 1 of our biggest focuses in our system is around previewing the data as much as possible and

allowing you to to very quickly iterate and and be able to,

I I a 100% agree with you there. I mean, especially because actually running an ETL process is very expensive. Like, there there's a huge cost associated with getting it wrong the first time. Yeah. It's probably an entire episode on its own. So Yeah. With that, I'd like to thank the both of you for taking the time today to join me and discuss the work that you've been doing with Upsolver and trying to make the data lake a manageable and enjoyable experience. So I appreciate the work that you're doing on that front, and I hope you enjoy the rest of your day. Thank you very much. You too. And thanks for having us.

Listening.

Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links