Summary
Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
- Your host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Cuelake is and the story behind it?
- There are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake?
- Who are the target users of Cuelake and how has that influenced the features and design of the system?
- Can you describe how Cuelake is implemented?
- What was your selection process for the various components?
- What are some of the sharp edges that you have had to work around when integrating these components?
- What involved in getting Cuelake deployed?
- How are you using Cuelake in your work at Cuebook?
- Given your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads?
- What are the advantages that a data lake/lakehouse architecture maintains over a warehouse?
- What are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse?
- What are the most interesting, innovative, or unexpected ways that you have seen Cuelake used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cuelake?
- When is Cuelake the wrong choice?
- What do you have planned for the future of Cuelake?
Contact Info
- vikrantcue on GitHub
- @vkrntd on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:19] Unknown:
Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the 1st DataSecOps platform that streamlines data access and security. Satori's DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server, and even delegates data access management to business users, helping you move your organization from default data access to need to know access.
Go to data engineering podcast.com/satori, that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy, and today I'm interviewing Vikrant Dube about Qubik and their Q Lake project for building ELT pipelines for your data lake house entirely in SQL. So, Vikrant, can you start by introducing yourself? My name is Vikrant. I head engineering at Qubook. Qubook is a business analytic startup,
[00:02:04] Unknown:
and I'm very excited to be here. And do you remember how you first got involved in data management? So Q Book is a business analytic startup. So we store our data in Apache Druid. So a lot of our clients are not experienced in building data pipelines for Apache Druid. So that's where we have to come in and do the data engineering for them as well. So we have to read data from different data sources, do some transforms, and then ingest it into it. We came into data management because of clients to help them with the data engineering. Can you describe a bit about what it is that you're building at Qubook and what it is that inspired you to build the Q Lake project and just some of the story behind that overall effort? So we are building an analytics tool at Q Book, and it's powered by Apache Druid. So Apache Druid gives us very good performance for all app queries.
So that's why we stuck with Apache Druid. And means if we use any other data warehousing solutions like Snowflake, so then we don't have to worry about the data engineering part that most of our clients have already figured it out. So since we are kind of bind with Apache Druid, and that's why, we have to do the data engineering as well. And that's where the motivation to build Qlik came from. And, also, we wanted to make it very simple since currently all the data engineering tools, you know, they are very complex and they cannot be used by an analyst. So we wanted that an analyst should be able to ingest data into it. That's the real motivation.
[00:03:40] Unknown:
In terms of the Q Lake project itself, there are a number of other platforms and systems available for being able to use SQL for processing or analyzing or transforming the data that lives in a data lake storage location or storage layer. I'm wondering what you saw as either being lacking in those systems or too complex in the available options that made you feel useful or necessary to build Qlik as an alternative to them? I would say complexity
[00:04:12] Unknown:
complexity was the only thing that motivated us to build Qlik. We used Databricks, we used Airflow, and we have used Remio for transforming the data. But these tools, they were not that simple. For example, in database, you have to set the spark context. You you have to do a lot of DevOps stuff before you start building. You start writing sequels. Means, even Databricks is not that SQL friendly. Means, first, you have to write some code to get the spark context, and then you can start writing SQLs. So we wanted to make it so simple that you just create a new notebook and you start writing SQL, and that's it. You are running distributed data system. In terms of
[00:05:00] Unknown:
the end users of that project, I'm wondering how that informed your choices of technology and the method of integration and the just overall interface of how you presented the Qlik project to make it accessible to those end users?
[00:05:17] Unknown:
So Qlik currently can be installed on any Kubernetes cluster. And it's in very nascent stage. It means we are still building it up. There are still a lot of bugs and a lot of things to work on. And let me tell you about the choices that we have made building Qlik. So we have used Zeppelin Notebooks instead of the widely popular Jupyter Notebooks. We've used Zeppelin because it had good integration with Spark. And currently, Qlik supports only Spark 3.0.2. We tried 3.1, but guys are working on it. It's not there yet. Support for Spark 3.1 via Zeplin. So Zeplin can create Spark cluster. Means Zeplin can create interpreters, Spark interpreters, on means when you need it. And it has an inactivity time out as well. So when you're not using it, it will delete the cluster, saving you the inflow cost. And that's what we needed. We wanted to use as less as
[00:06:14] Unknown:
possible inflow. With the disposable infrastructure and the ability to spin up the cluster and execute the SQL and tear it down all as 1 sort of smooth operation, I'm curious what were some of the edge cases that you ran into or some of the complexities that you had to engineer around to make that work reliably where you didn't have the situation of somebody writing their SQL, trying to execute it, and then having no cluster or running into split brain situations or managing some of the scaling aspects to make sure that you're able to provide a interactive enough experience, especially if people have large volumes of data that they're trying to analyze?
[00:06:52] Unknown:
So Zeplin had default, connection time out of 1 minute. So when you execute some code in your notebook, it will try to spin up a new interpreter. And that new interpreter should come within a minute. And if it will not come in within a minute, then it will time out and the queries will fail. So it was not that difficult, but understanding the Zeppelin code and building Zeppelin from source, it's a difficult task. That's why Zeppelin is not used that much. Zeplin has a lot of more features than Jupyter. You can write multiple languages in different paragraphs.
[00:07:30] Unknown:
And as far as Zeplin being the interface, and you're targeting of analysts as some of the end users, I'm wondering if you ran into any issues of people who maybe had experience with Jupyter and needing to remap their experience to work with Zeppelin or if the overall paradigm is similar enough people are able to quickly get up and running with it? I believe it's similar enough, but that's my limited knowledge. And it means I have not worked on Jupyter that much. So in my opinion, it's similar enough, and there should not be a transition gap if you're using Jupyter. And if you're going to Zeplin, I think it should be simple enough. Should be straightforward. Digging a bit more into the design and architecture of Qlik, you mentioned that it is built to run off of Kubernetes, and I know that sometimes dealing with Spark and Kubernetes context can be complicated. You have to build in some additional niceties to the container image to make it come up cleanly and work well. And I'm just wondering if you can talk through a bit more of engineering Qlik to be cloud native and work effectively in Kubernetes clusters, especially given the potential variability in how Kubernetes might be configured or the versioning of Kubernetes and just some of those aspects of trying to build a system that is intended for end users that is easy to operate and dealing with all these potential variances?
[00:08:51] Unknown:
So we have tested Qlik in AWS, GCP, and Azure. And Kubernetes versions, I guess, it should be fine because it's very simple. This 1, our back role, which is creating and deleting all these resources, reading country maps, reading secrets. We did face a lot of challenge in, running Spark or Kubernetes. So but we kept on going, and we figured out all the variables that are required, and we bundled it together. So when you spin up unique instance, all those variables are already set, and you just have to start writing SQLs. So it's not means it's still in development. We are working on a workspace feature. So what you can do is you can select your storage provider. Qlik can be installed on any Kubernetes cluster.
You can choose a storage provider, GCS, s 3, or Azure storage. You give location, the credentials, and Qlik will install your Hive metastore for you. And you can just start writing SQLs. You can see the Hive metastore
[00:09:54] Unknown:
tables via Drawer UI. In terms of your experience of working with these technologies and building a SQL first workflow on top of a data lake or data lakehouse architecture. I'm wondering how that compares to your experiences of working with some of these dedicated cloud data warehouses that might be a more cohesive experience, but have their own set of limitations at just what you see as being the overall trade offs and some of the motivation for building toward this lakehouse approach versus just using a cloud warehouse?
[00:10:26] Unknown:
So if we compare on performance and the features they provide, definitely, cloud data warehouse is much, much way ahead of a data lake or a lakehouse. But the only thing where data lake or lakehouse is its price. Means, there is a lot of data that you don't want to actively use it, and you just want to have it stored at some place. So if you have that kind of data and, you don't want to pay
[00:10:55] Unknown:
a lot of money for that, then you can have a data lake or lake house. Can you give a bit more of an overview of what you're building at Qubook and some of the ways that your focus on applying machine learning to the problem of business metrics has benefited from using this data lake architecture for being able to process the data and some of the ways that you're leveraging Druid to help with some of the sort of operational and analytical aspects of providing this feedback to your end users. Druid, being a time series database,
[00:11:30] Unknown:
it's very easy to run anomaly detection and forecast on data that we fetch from Druid. We use open source library by Facebook called Prophet to do the anomaly detection and forecasts. So Qlik helps us to bring data to to the Druid. And then the rest of the things, we have other Python workflows that run on top of Druid. And so in terms of the types of
[00:11:58] Unknown:
analytics and anomaly detection that you're doing for your end users, I'm wondering if you can just talk about some of the types of business metrics that you are monitoring and some of the potential value and utility that users are able to get from identifying when there is variance in the metrics that they're tracking for their business?
[00:12:18] Unknown:
So mostly, these are operational metrics. So for example, the number of website visitors, the orders from different regions. So suppose that orders from a region are getting dropped. So you'll get an anomaly alert, and you can get into the root cause analysis that why are the orders dropping, and then you'll see. There's a funnel also where you can see the top level metric that affects the orders in that region. So it's a very easy way to do root cause analysis in your data. So once you get the anomaly, you can click on that anomaly and you'll have a detailed detailed descriptive page that will show you like what could be the reasons and then you can figure it out like
[00:13:07] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch. As far as the overall data lake and lake house approach, what are some of the shortcomings that you've experienced where you maybe wish you had more of the capabilities or experiential and sort of polish aspects of the cloud warehouses?
[00:13:56] Unknown:
The limitation that we strongly feel means when we want to query, we'll have to spin a cluster. That takes, like, 2 to 3 minutes, and that bothers us a lot. We want our data to be readily queryable. And that means data warehouses means you can just go and you can query the data and you get the results instantly. But here in Qlik, you'll have to wait for 2, 3 minutes for the first query to execute. And we have thought that how can we solve it, but I think there's no solution to it. You'll have to have a cluster running to make the data queryable. So we want to have the storage and compute totally separate, and the compute should be up instantly.
And just like Snowflake. In Snowflake, you can switch cluster sizes just by a query, and it switches instantly. So that's not the case with Qlik, and that's a major limitation.
[00:14:48] Unknown:
I'm wondering what you see as some of the sort of scalability or experience benefits of using the notebook interface with Spark as the execution engine versus things like, you know, Presto or, versus things like Presto or Trino for doing the query execution on top of object storage and maybe using something like the Glue metadata catalog or the Hive catalog for managing the table space?
[00:15:15] Unknown:
So we are using Hive catalog for managing the table tables metadata. So we already first means we started with AWS Glue catalog, and then we switched to Hive because it was more convenient. Because when you install a Qlik on a different cloud provider, then you'll not get database glue. That means that's what we are working on right now. When you create a workspace, it will create a Hive metastore for you. And Presto and Trino means I haven't worked with them yet.
[00:15:44] Unknown:
And then as far as the actual table layer, I noticed that you're using iceberg for the table format. And I'm wondering what your motivation was for selecting that versus using just the native Hive table format and some of the specific features that you were looking at that maybe led you to choosing that over something like Hoodie, for instance.
[00:16:05] Unknown:
Hoodie was an alternative when we were building it. We went with Iceberg because at the time when we started building it, Iceberg was getting really popular. But as of today, we support Iceberg and Delta both. Means, the code is still in development, but in the next release, it will be there. You can have your asset provider as Iceberg or Delta. And we chose Iceberg and Delta because because of our requirements. We wanted to do upserts on data lake. And these are the best technologies that we have right now, Iceberg, Delta, and Hudi. Given that you are a small startup and it sounds like you're fairly early in your journey, I'm wondering what your goal was with
[00:16:47] Unknown:
investing in building Qlik and then releasing it as open source versus just building something that worked for you and not having to deal with generalizing it and just using it internally?
[00:16:57] Unknown:
So that was our goal. We wanted to use it internally, and we open sourced it just to see if what we think is being accepted by other people or not. So we got good traction, and we are continuing building on. Have you seen much community engagement and people contributing
[00:17:16] Unknown:
back to it? Or is it mostly just been people who are installing it and testing it out and then giving some feedback about what they found useful or not? We are not getting that much of feedback. So for feedback, we
[00:17:27] Unknown:
have to go to people and ask them, like, what's missing in this or what's what do you want to be included in QDEX so that you can use it in future? Since it's a complicated product and it it can be only installed on a Kubernetes. If it's a simple product, then everyone can use it. So that's why we're not getting that much of feedback. And we do not know that if anyone is using it or not. Means, we haven't installed any analytics script or something like that. And you mentioned that it started off as an internal project. So was there any sort of unwinding
[00:17:56] Unknown:
of internal assumptions that you had to do to be able to release it as open source or anything that you had to segment out to then apply back at deploy time versus just having it hard coded into the system?
[00:18:08] Unknown:
So if you see the earlier version of Qlik, there is some code that is used specifically by QBook, and it's still there. So we are building it primarily for ourself, and means in our spare time, we focus on generalizing it. And in terms
[00:18:27] Unknown:
of your experience of using Q Lake and releasing it as open source, what are some of the most interesting or innovative or unexpected ways that you've seen it used or benefits that you've been able to realize after having built it? We didn't get any feedback
[00:18:44] Unknown:
that it's being used. So other people that are in this industry, they have seen it, and they thought that it's pretty cool that it can create a cluster, and you can modify the cluster size on the fly. So suppose that your query is not running in this cluster size, you can just go ahead and modify it and in 2, 3 minutes, so you'll get to know that query or not. So means we got interesting feedback, but I don't think anyone is using it. At least we don't know about it. In terms of the actual
[00:19:13] Unknown:
effort to build Qlik, I'm wondering what were some of the initial ideas that you had about what you wanted to do with it or some of the assumptions about how it might work or how it might be used internally at Q Book. Wondering what are some of the things that you ended up having to either reimagine or ways that those assumptions have been challenged as you began to work through the process of actually building the system and integrating it and getting it deployed and operational for your purposes?
[00:19:44] Unknown:
So when we started building it, we were very new to the state engineering technologies. We had AWS Glue as our catalog, and we did the authentication via AWS I'm roles. So we thought it's going to be stuck in AWS only, and it's going to be AWS only product. Means it can be installed on only AWS. But as we kept on developing on it, we got to know that there's high catalog as well. And high catalog was more flexible than AWS Glue. So we are moving towards generalizing it. Initially, we didn't think that we can directly write spark equals. That was a good thing that came out of it. We didn't plan it. We thought that user will write any code, Python, Java, Scala. But when we build it, we realized that, okay, users can just write SQLs.
[00:20:38] Unknown:
In terms of your experience of building the Qlik project, particularly as somebody who's fairly new to the data engineering ecosystem, what are some of the concepts that you ran into that were either challenging to learn or that you found particularly beneficial as you were building up the business and building the project?
[00:20:56] Unknown:
The most challenging thing was learning about all of these jars. I mean, for each specific task, you have to do you have to install 1 additional jar, and that jar should be compatible with all other jars. So this was the most challenging thing, and means we have to do it for Spark. We have to do it for Zeplin, because all of these are Java projects. But then when we got a fair bit of understanding, it was easy. But the people who are new to data engineering, I think they will face a challenge, and that's why we are building Q Lake. So that it should be simple. There is 1 complication that you need a Kubernetes cluster, but we'll think of some something will solve it somehow that you can directly use it on a EC 2 instance.
[00:21:40] Unknown:
Another interesting aspect of how you designed Q Lake is that you ended up going straight for Celery for managing the scheduling and noticed in the notes that you're looking at bringing in airflow to add to some of the scheduling capabilities. And I'm wondering what some of the benefits of just using Celery directly versus bringing in the complexity of the entire airflow stack and some of the additional capabilities that you're hoping to be able to bring in by adding airflow to the overall system architecture?
[00:22:12] Unknown:
We did not go with airflow because, again, it's my limited knowledge in maybe 9. That airflow is because of its ecosystem means there are a lot of sensors, connectors that the airflow community has. And we realized that we don't need it, so why bring in the airflow complexity? We what we needed is we just wanted to get a API so that Zeplin can run these notebooks. And that's why we didn't go with Airflow. And Airflow was very intensive. Even it's in idle state, it consumed a lot of memory, and we wanted to make it as light as possible. So in current state, Qlik uses just 500 megabits of RAM, very minimal CPU, and it can run 500 notebooks in parallel.
[00:22:58] Unknown:
For people who are interested in being able to do SQL only pipeline development for their data lakes and data lake houses, what are the cases where Q Lake is the wrong choice and somebody might be better served with a different solution?
[00:23:12] Unknown:
I think it's a wrong choice means in its current state because it's not production ready. There is no authentication. So anyone will not use it in its current state. But when we have this authentication capabilities and a little more reliableness, So there are some bugs here and there that we are aware of, and we are working on solving it. So after that's done, then if you don't want to manage a Kubernetes cluster, then only Qlik can be a wrong choice. Otherwise, I think it will have all the benefits that other tools offer and with much more simplicity.
[00:23:49] Unknown:
As you continue to use Qlik and extend it, what are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to work on?
[00:23:59] Unknown:
So is there is a client's project that we are going to work on using Qlik. But that's also, like, all other clients. The data size is a little higher, but it's not challenging. But, what I see that where Qlik can be used is deploying ML pipelines. It has a Python interpreter, and means you can write Python code, can build your own Docker images. You can set those images in Qlik. And I think there is a use case for MLOps here. Like, you can train your model inside Qlik using Zeppelin notebooks. And when your model is ready, you can leverage this
[00:24:40] Unknown:
distributed computing so that your model can handle data at scale also. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll ask the final question of from your perspective, I'm wondering what you see as being the biggest gap in the tooling or technology that's available for data management today. So the biggest gap that what I think is so all of these tools are not simple enough. Like, if you want to build a WordPress site, there are a of tools available. You just do 1 click and boom, you go. Means you can start creating post articles.
[00:25:16] Unknown:
And data engineering should be simpler means even small startups are producing huge amounts of data, and they need distributed computing readily available, ease, easily accessible. They have solutions like Snowflake, BigQuery, Redshift, and they go for that. Means that's why these solutions are much popular. And means building a data lake or lakehouse is much cheaper than going for a data warehouse. And I believe they they are here to coexist with means you'll store some data in a data warehouse, and you'll store some data in data lake or a lake house based on how you're going to create and how you need it. Means there's a lot of data that you don't want for analytics.
You won't want to store it in a data warehouse. So the biggest gap is simplicity, means these data warehouse technologies are much simpler, means BigQuery is simple, Snowflake is more simpler, in my opinion. But building a data lake or lakehouse, it's still complicated. Means, even with tools like Databricks, it requires a lot of DevOps knowledge and data engineering knowledge.
[00:26:22] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Q Lake and how you're using it at QBook. It's definitely a very interesting project and 1 that I agree there's a lot of need for more polish and a more unified user experience for building analytics on data lakes. So I appreciate the time and effort that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Julius. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction and Overview
Interview with Vikrant Dube: Introduction and Background
Building Qubook and the Q Lake Project
Motivation Behind Q Lake
Technical Choices and Challenges
Design and Architecture of Q Lake
Comparing Data Lakehouse and Cloud Data Warehouses
Business Metrics and Anomaly Detection
Shortcomings of Data Lakehouse Approach
Technical Details: Hive Catalog and Table Formats
Open Sourcing Q Lake
Challenges and Learnings
Simplifying Data Engineering
Scheduling and Workflow Management
Future Plans for Q Lake
Biggest Gaps in Data Management Tools
Closing Remarks