Let Your Analysts Build A Data Lakehouse With Cuelake

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements?

Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data?

Satori has built the 1st DataSecOps

platform that streamlines data access and security.

Satori's DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server,

and even delegates data access management to business users, helping you move your organization from default data access to need to know access.

Go to data engineering podcast.com/satori,

that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy, and today I'm interviewing Vikrant Dube about Qubik and their Q Lake project for building ELT pipelines for your data lake house entirely in SQL. So, Vikrant, can you start by introducing yourself? My name is Vikrant. I head engineering at Qubook. Qubook is a business analytic startup,

and I'm very excited to be here. And do you remember how you first got involved in data management? So Q Book is a business analytic startup. So

we store our data in Apache Druid.

So a lot of our clients are not

experienced in

building data pipelines for Apache Druid. So that's where we have to come in and do the data engineering for them as well. So we have to read data from different data sources,

do some transforms, and then ingest it into it. We came into data management because of clients to help them with the data engineering. Can you describe a bit about what it is that you're building at Qubook and what it is that inspired you to build the Q Lake project and just some of the story behind that overall effort? So we are building an analytics tool at Q Book, and it's powered by Apache Druid.

So Apache Druid gives us very

good performance for all app queries.

So that's why we stuck with Apache Druid.

And

means if we use any other data warehousing solutions like Snowflake,

so then we don't have to worry about the data engineering part that most of our clients have already figured it out. So since we are kind of bind

with Apache Druid, and that's why,

we have to do the data engineering as well. And that's where the motivation to build Qlik came from.

And, also, we wanted to make it very simple

since currently all the data engineering tools, you know, they are very complex and they cannot be used by an analyst.

So we wanted that an analyst should be able to ingest data into it. That's the real motivation.

In terms of

the Q Lake project itself,

there are a number of other platforms and systems available for being able to

use SQL for

processing or analyzing or transforming the data that lives in a data lake storage location or storage layer. I'm wondering what you saw as either being lacking in those systems or too complex in the available options that made you feel useful or necessary to build Qlik as an alternative to them? I would say complexity

complexity

was the only thing that motivated

us to build Qlik.

We used Databricks,

we used Airflow, and

we have used Remio for transforming

the data.

But these tools, they were not that simple. For example, in database, you have to set the spark context. You you have to do a lot of DevOps stuff before you start building.

You start writing sequels.

Means, even Databricks is not that SQL friendly. Means, first, you have to write some code to get the spark context, and then you can start writing SQLs. So we wanted to make it so simple that you just create a new notebook and you start writing SQL, and

that's it. You are running distributed data system. In terms of

the end users of that project, I'm wondering how that informed your

choices of technology

and the method of integration and the just overall interface of

how you presented the Qlik

project to make it accessible to those end users?

So Qlik currently can be installed on any Kubernetes cluster.

And

it's in very nascent stage. It means we are still building it up. There are still a lot of bugs and a lot of things to work on.

And let me tell you about the choices that we have made building Qlik. So we have used Zeppelin Notebooks instead of the widely popular Jupyter Notebooks.

We've used Zeppelin because it had good integration with Spark.

And currently, Qlik supports only Spark 3.0.2.

We tried 3.1, but

guys are working on it. It's not there yet. Support for Spark 3.1

via Zeplin. So Zeplin can create

Spark cluster. Means Zeplin can create interpreters, Spark interpreters,

on means when you need it. And it has an inactivity time out as well. So when you're not using it, it will delete the cluster,

saving you the inflow cost. And that's what we needed. We wanted to use as less as

possible inflow. With the disposable infrastructure and the ability

to spin up the cluster and execute the SQL and tear it down all as 1 sort of smooth operation, I'm curious what were some of the

edge cases that you ran into or some of the complexities that you had to engineer around to make that work reliably

where you didn't have the situation of somebody

writing their SQL, trying to execute it, and then having no cluster or running into split brain situations or managing some of the

scaling aspects to make sure that you're able to provide a interactive enough experience, especially if people have large volumes of data that they're trying to analyze?

So

Zeplin

had default,

connection time out of 1 minute.

So when you execute

some code in your notebook, it will try to spin up a new interpreter.

And

that new interpreter should come within a minute.

And

if it will not come in within a minute, then it will time out and the queries will fail. So

it was not that difficult, but understanding the Zeppelin code and building Zeppelin from source, it's a difficult task.

That's why Zeppelin is not used that much. Zeplin has a lot of more features than Jupyter. You can write multiple languages in different paragraphs.

And as far as Zeplin being the interface,

and you're targeting

of analysts as some of the end users, I'm wondering if you ran into any issues of people who maybe had experience with Jupyter and needing to

remap their experience to work with Zeppelin or if the overall paradigm is similar enough people are able to quickly get up and running with it? I believe it's similar enough, but that's my limited knowledge. And it means I have not worked on Jupyter that much.

So in my opinion, it's similar enough, and there should not be a transition gap if you're using Jupyter. And if you're going to Zeplin, I think it should be simple enough. Should be straightforward. Digging a bit more into the design and architecture of Qlik, you mentioned that it is built to run off of Kubernetes, and I know that sometimes dealing with Spark and Kubernetes context can be complicated. You have to build in some additional niceties to the container image to make it come up cleanly and work well.

And I'm just wondering if you can talk through a bit more of engineering Qlik to be cloud native and work effectively in Kubernetes clusters, especially given the

potential variability in how Kubernetes might be configured or the versioning of Kubernetes and just some of those aspects of trying to build a system that is intended for end users that is easy to operate and dealing with all these potential variances?

So we have tested Qlik in AWS,

GCP, and Azure.

And Kubernetes versions, I guess, it should be fine because it's very simple.

This

1, our back role,

which is creating and deleting all these resources,

reading country maps, reading secrets. We did face a lot of challenge in, running Spark or Kubernetes.

So but we kept on going, and we figured out all the variables that are required, and we bundled it together.

So when you spin up unique instance, all those variables are already set, and you just have

to start writing SQLs.

So it's not means it's still in development. We are working on a workspace feature. So what you can do is you can select your storage provider. Qlik can be installed on any Kubernetes

cluster.

You can choose a storage provider, GCS,

s 3, or Azure storage.

You give location, the credentials, and Qlik will install your Hive metastore for you.

And you can just start writing SQLs. You can see the Hive metastore

tables via Drawer UI. In terms of your experience of working with these technologies and building a SQL first workflow on top of a data lake or data lakehouse architecture.

I'm wondering how that compares to your experiences of working with some of these dedicated cloud data warehouses that might be a more cohesive experience, but have their own set of limitations at just what you see as being the overall trade offs and some of the motivation for building toward this lakehouse approach versus just using a cloud warehouse?

So if we compare on performance and the features they provide, definitely, cloud data warehouse is much, much way ahead of a data lake or a lakehouse.

But the only thing where data lake or lakehouse is its price.

Means, there is a lot of data that you don't want to actively use it, and you just want to have it stored at some place.

So if you have that kind of data

and,

you don't want to pay

a lot of money for that, then you can have a data lake or lake house. Can you give a bit more of an overview of what you're building at Qubook and some of the ways that your focus on applying machine learning to the problem of business metrics has

benefited from using this data lake architecture for being able to process the data

and some of the ways that you're leveraging Druid to help with some of the sort of operational and analytical aspects of providing this feedback to your end users. Druid, being a time series database,

it's very easy to run anomaly detection and forecast on data that we fetch from Druid. We use

open source library by Facebook called Prophet

to do the anomaly detection

and forecasts.

So Qlik helps us

to bring data to to the Druid. And

then the rest of the things, we have other Python workflows that run on top of Druid. And so in terms of the types of

analytics and anomaly detection that you're doing for

your end users, I'm wondering if you can just talk about some of the types of business metrics that you are monitoring and some of the potential value and utility that users are able to get from identifying

when there is variance in the metrics that they're tracking for their business?

So mostly, these are operational metrics.

So for example, the number of website visitors,

the orders from different regions. So suppose that orders from a region are getting dropped.

So you'll get an anomaly alert, and you can get into the root cause analysis that why are the orders dropping, and then you'll see.

There's a funnel also where you can see the top level metric that

affects the orders in that region.

So it's a very

easy way to do root cause analysis

in your data. So once you get the anomaly, you can click on that anomaly and you'll have a detailed

detailed descriptive page that will show you like what could be the reasons and then you can figure it out like

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch.

As far as the

overall data lake and lake house approach, what are some of the shortcomings that you've experienced

where you maybe wish you had more of the capabilities

or

experiential and sort of polish aspects of the cloud warehouses?

The limitation

that we strongly feel

means when we want to query, we'll have to spin a cluster. That takes, like, 2 to 3 minutes, and that bothers us a lot. We want our data to be readily queryable.

And that means data warehouses means you can just go and you can query the data and you get the results instantly. But here in Qlik, you'll have to wait for 2, 3 minutes for the first query to execute.

And we have thought

that how can we solve it, but I think there's no solution to it. You'll have to have a cluster running to make the data queryable.

So we want to have the storage and compute totally separate, and the compute should be up instantly.

And just like Snowflake. In Snowflake, you can switch cluster sizes just by a query, and it switches instantly.

So that's not the case with Qlik, and that's a major limitation.

I'm wondering what you see as some of the

sort of scalability or experience benefits

of using the notebook interface with Spark as the execution engine versus

things like, you know, Presto or,

versus things like Presto or Trino

for doing the query execution on top of object storage and maybe using something like the

Glue metadata catalog or the Hive catalog for managing the table space?

So we are using Hive catalog for managing the table tables metadata. So we already first means we started with AWS Glue catalog, and then we switched to Hive because it was more convenient. Because when you install a Qlik on a different cloud provider, then you'll not get database glue. That means that's what we are working on right now. When you create a workspace, it will create a Hive metastore for you. And

Presto and Trino means I haven't worked with them yet.

And then as far as

the actual table layer, I noticed that you're using iceberg for the table format. And I'm wondering

what your

motivation was for selecting that versus using just the native Hive table format and some of the specific features that you were looking at that maybe led you to choosing that over something like Hoodie, for instance.

Hoodie was an alternative when we were building it. We went with Iceberg because at the time when we started building it, Iceberg was getting really popular.

But as of today,

we support Iceberg and Delta both. Means, the code

is still in development, but in the next release, it will be there. You can have your asset provider as Iceberg or Delta.

And we

chose Iceberg and Delta because because of our requirements. We wanted to do upserts on data lake. And these are the best technologies that we have right now, Iceberg, Delta, and Hudi. Given that you are a small startup and it sounds like you're fairly early in your journey, I'm wondering what your goal was with

investing in building Qlik and then releasing it as open source versus just building something that worked for you and not having to deal with generalizing it and just using it internally?

So that was our goal. We wanted to use it internally, and we open sourced it just to see if what we think is being accepted by other people or not.

So we got good traction, and we are continuing building on. Have you seen much community engagement and people contributing

back to it? Or is it mostly just been people who are installing it and testing it out and then giving some feedback about what they found useful or not? We are not getting that much of feedback. So for feedback, we

have to go to people and ask them, like, what's missing in this or what's

what do you want to be included in QDEX so that you can use it in future?

Since it's a complicated product and it it can be only installed on a Kubernetes. If it's a simple product, then everyone can use it. So that's why we're not getting that much of feedback. And we do not know that if anyone is using it or not. Means, we haven't installed any analytics script or something like that. And you mentioned that it started off as an internal project. So was there any sort of unwinding

of internal assumptions that you had to do to be able to release it as open source or anything that you had to segment out to then apply back at deploy time versus just having it hard coded into the system?

So if you see the earlier version of Qlik, there is

some code that is used specifically by QBook,

and it's still there. So

we are building it primarily for ourself, and means in our spare time, we focus on generalizing it. And in terms

of your experience of using Q Lake and releasing it as open source, what are some of the most interesting or innovative or unexpected

ways that you've seen it used or benefits that you've been able to realize after having built it? We didn't get any feedback

that it's being used.

So other people that are in this industry, they have seen it, and they

thought that it's pretty cool that it can create a cluster, and you can modify the cluster size on the fly. So suppose that your query is not running in this cluster size, you can just go ahead and modify

it and in 2, 3 minutes, so you'll get to know that query or not.

So means we got interesting feedback, but I don't think anyone is using it. At least we don't know about it. In terms of the actual

effort to build Qlik, I'm wondering what were some of the initial ideas that you had about what you wanted to do with it or some of the

assumptions about how it might work or how it might be used internally at Q Book. Wondering

what are some of the things that you ended up having to either reimagine

or

ways that those assumptions have been challenged as you began to work through the process of actually building the system and integrating it and getting it deployed and operational for your purposes?

So when we started building it, we were very new to the state engineering technologies.

We had AWS Glue as our catalog,

and

we did the authentication via AWS I'm roles.

So we thought it's going to be stuck in AWS only, and it's going to be AWS only product.

Means it can be installed on only AWS.

But as we

kept on developing on it, we got to know that there's high catalog as well. And

high catalog was more flexible than AWS Glue. So we are moving towards generalizing it. Initially,

we didn't

think that we can directly write spark equals.

That was a good thing that came out of it. We didn't plan it. We thought that user will write any code, Python, Java, Scala. But when we build it, we realized that, okay, users can just write SQLs.

In terms of your experience of building the Qlik project, particularly as somebody who's fairly new to the data engineering ecosystem,

what are some of the concepts that you ran into that were either

challenging to learn or that you found particularly beneficial as you were building up the business and building the project?

The most challenging thing was learning about all of these jars. I mean, for each specific task, you have to do you have to install

1 additional jar, and that jar should be compatible with all other jars. So this was the most challenging thing, and means we have to do it for Spark. We have to do it for

Zeplin,

because all of these are Java projects. But then when we

got

a fair bit of understanding, it was easy. But

the people who are new to data engineering, I think they will face a challenge, and that's why we are building Q Lake. So that it should be simple. There is 1 complication that you need a Kubernetes cluster, but we'll think of some something

will solve it somehow that you can directly use it on a EC 2 instance.

Another interesting aspect of how you designed Q Lake is that you ended up going straight for Celery for managing the scheduling

and

noticed in the notes that you're looking at bringing in

airflow to add to some of the scheduling capabilities. And I'm wondering

what some of the benefits of just using Celery directly versus bringing in the complexity of the entire airflow stack and some of the additional capabilities that you're hoping to be able to

bring in by adding airflow to the overall system architecture?

We did not

go with airflow because,

again, it's my limited knowledge

in maybe 9. That airflow is because of its ecosystem means there are a lot of sensors, connectors

that the airflow community has.

And

we realized that we don't need it, so why bring in the airflow complexity?

We what we needed is we just wanted to get a API so that Zeplin can run these notebooks.

And that's why we didn't go with Airflow. And Airflow was very intensive.

Even it's in idle state, it consumed a lot of memory,

and we wanted to make it as light as possible. So in current state, Qlik uses just 500 megabits of RAM, very minimal CPU,

and it can run 500 notebooks in parallel.

For people who are interested in being able to do SQL only pipeline development for their data lakes and data lake houses, what are the cases where Q Lake is the wrong choice and somebody might be better served with a different solution?

I think it's a wrong choice means in its current state because it's not production ready. There is no authentication.

So

anyone will not use it in its current state. But when we have this authentication capabilities and a little more reliableness,

So there are some bugs here and there that we are aware of, and we are working on solving it.

So after that's done, then

if you don't want

to manage a Kubernetes cluster,

then only Qlik can be a wrong choice. Otherwise, I think it will have all the benefits that other tools offer and with much more simplicity.

As you continue to

use Qlik and extend it, what are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to work on?

So is there is a client's project that we are going to work on using Qlik.

But that's also, like, all other clients. The data size is a little

higher,

but it's not challenging. But, what I see that where Qlik can be used is deploying ML pipelines. It has a Python interpreter, and

means you can write Python code, can build your own Docker images. You can set those images in Qlik.

And

I think

there is a use case for MLOps here.

Like, you can

train your model inside Qlik using

Zeppelin notebooks. And when your model is ready, you can leverage this

distributed computing so that your model can handle data at scale also. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll ask the final question of from your perspective, I'm wondering what you see as being the biggest gap in the tooling or technology that's available for data management today. So the biggest gap that what I think is so all of these tools are not simple enough. Like, if you want to build a WordPress site, there are a of tools available. You just do 1 click and boom, you go. Means you can start creating post articles.

And data engineering should be simpler means

even small startups are producing huge amounts of data, and they need distributed computing readily available, ease, easily accessible.

They have solutions like Snowflake, BigQuery,

Redshift, and they go for that. Means that's why these solutions are much popular.

And means building a data lake or lakehouse is much cheaper than going for a data warehouse. And I believe they they are here to coexist with means you'll store some data in a data warehouse, and you'll store some data in data lake or a lake house

based on how you're going to create and how you need it. Means there's a lot of data that you don't want for analytics.

You won't want to store it in a data warehouse.

So the biggest gap is simplicity, means these data warehouse technologies are much simpler, means BigQuery is simple, Snowflake is more simpler,

in my opinion. But

building a data lake or lakehouse, it's still complicated. Means, even with tools like Databricks,

it requires a lot of DevOps knowledge and data engineering knowledge.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Q Lake and how you're using it at QBook. It's definitely a very interesting project and 1 that I agree there's a lot of need for more polish and a more unified user experience for building analytics on data lakes. So I appreciate the time and effort that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Julius.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links