Summary
The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you mean by the term "Data Orchestration"?
- How does it compare to the concept of "Data Virtualization"?
- What are some of the tools and platforms that fit under that umbrella?
- What are some of the motivations for organizations to use the cloud for their data oriented workloads?
- What are they giving up by using cloud resources in place of on-premises compute?
- For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments?
- What are some of the common patterns for cloud migration projects and what challenges do they present?
- Do you have advice on useful metrics to track for determining project completion or success criteria?
- How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals?
- Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?
- What are some of the common pain points that organizations encounter when working on hybrid implementations?
- What are some of the missing pieces in the data orchestration landscape?
- Are there any efforts that you are aware of that are aiming to fill those gaps?
- Where is the data orchestration market heading, and what are some industry trends that are driving it?
- What projects are you most interested in or excited by?
- For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Alluxio
- UC San Diego
- Couchbase
- Presto
- Spark SQL
- Data Orchestration
- Data Virtualization
- PyTorch
- Rook storage orchestration
- PySpark
- MinIO
- Kubernetes
- Openstack
- Hadoop
- HDFS
- Parquet Files
- ORC Files
- Hive Metastore
- Iceberg Table Format
- Data Orchestration Summit
- Star Schema
- Snowflake Schema
- Data Warehouse
- Data Lake
- Teradata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC. Data Coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data Coral's customers report that their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance.
Raghu Murthy, founder and CEO of DataCorral, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Data Coral with the goal to make SQL the universal data programming language. Visitdataengineeringpodcast.com/data coral today to find out more. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as Dataversity, Corinium Global Intelligence, Alexio, and Data Council.
Upcoming events include the combined events of the data architecture summit in Graphorum, the data orchestration summit, and the data council in New York City. Go to data engineering podcast.com/conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macy, and today I'm interviewing Deepti Borkar about data orchestration and how it helps in migrating data workloads to the cloud. So, Deepti, can you start by introducing yourself? Yes. Hi, Tobias. This is Deepti. I'm the vice president of product and marketing here at Alexio, and, great to be with you here today. And do you remember how you first got involved in the area of data management?
[00:02:41] Unknown:
Yes. So a fun fact about that, if you look at my initials, they're actually DB database. And so I joke that it was actually I was actually born with that talent. But, in reality, I was actually working on my master's thesis in the database lab at UC San Diego, many, many years ago. I won't tell you how long. And that's kind of how I got interested in it. I started off, writing kernel code for d b 2 distributed for Linux, and I was there for many years. And then over time, moved in, you know, the the data management database space has emerged quite a bit over time. Structured data, unstructured, semi structured data. I spent, many years at, Couchbase, a NoSQL database, MarkLogic, and, at a GPU analytics company right before this. Analogio was very interesting because it kind of took the, the the database in some ways and each layer has now become its own layer, the query engine.
You can think of it as Presto or Spark SQL. Eloxio is kind of like, you know, buffer pool in some ways, and you have the storage engines. And so so it still kind of kept me in that database path, and, I I got really interested in Eloxio given the cloud,
[00:03:55] Unknown:
movement that we're seeing in the market as well. Yeah. It's definitely interesting how a lot of the products that have previously been very vertically integrated have been starting to get sort of exploded into different projects where the database is getting exploded, as you said, into the query planning layer and then the data storage layer as separate. And then also for things like ETL frameworks where it used to be an end to end solution, those have been exploded into a programming layer that then sits on top of an execution substrate, and then cloud has just made us rethink the way that we build all these different applications
[00:04:37] Unknown:
and previously integrated stacks. Yeah. Absolutely. I think the you know, there's a big architectural shift. In some ways, people are, you know, calling it separation of computer and storage, but it's not just computer and storage. Right? It's all the layers in the middle. And, and so, with the cloud, the rise of the object stores, data silos increasing, data exploding in general, I think there's just such a need for new architectures that can manage this, this vastness of data that's we're still in the beginnings of, frankly. So it's a very interesting space to be in. And and and the challenge that we are dealing with at Alexio obviously is, being between compute and storage and, enabling the separation and helping orchestrate data, making data more available and friendly, if you will, to the compute.
[00:05:26] Unknown:
So we've touched on this term a couple of times as far as data orchestration. I'm wondering if you can give a bit of a description about how you define the term so that we can use that as a launching point for the rest of the conversation. Yeah. Sure. Absolutely. So,
[00:05:40] Unknown:
what we, say data orchestration is is a method or methodology of, enabling data for the purposes of compute. And this has many different aspects bundled in. So it's restoring data locality. Data locality is a big aspect of computing, and in these environments, it gets spread spread out, separated out. Data accessibility. How is the same data accessible in many, many different ways? It's gonna be more important to the co computational frameworks. And then finally, data elasticity. And in some ways, it's a a little bit like data virtualization, but it's it's a a global namespace. So a common, way to access a a range of silos of data that might live underneath. And so that's kind of, how we think of data orchestration as. It's it's an enabler, for compute to make data more available, more accessible, and elastic,
[00:06:38] Unknown:
to compute. There's another term that I've come across a few times that's data virtualization. And I'm wondering if you can give a bit of a comparison between data orchestration and data virtualization so that for anybody who happens to come across them later on can clearly understand what they're dealing with? Absolutely.
[00:06:57] Unknown:
Yeah. So data virtualization, in some ways, isn't new because it's, you know, whenever there are silos, you have, you might need to abstract those and there's a layer in the middle that's a it's a logical layer, if you will. It's a logical layer that allows you to abstract, the data silos underneath. And that's really what data virtualization is. It's kind of the common namespace or global namespace across those silos, be it relational databases, be it file systems, be it unstructured data. But orchestration, in some ways is more than just data virtualization because while, you need, the abstraction, and it's very important to abstract away where the physical data, the bits and bytes and blocks actually live. There are other aspects that become more important in this, as we talked about disaggregated stack. When your, let's say, a spark job is running in in a, in a in a cloud and your data is in a remote s 3, in a remote region or in HDFS on prem, then you obviously need that abstraction because you don't want to deal with multiple storage systems.
But you also need to pull your data closer to your compute and you also need to make it available, using the right APIs. And so that's where, in some ways, data virtualization is 1 of the elements of orchestration. But orchestration needs to be so much more because once the data lands in a data lake, in a data silo, then it's going to be used and, and accessed in many, many different ways. It could be using a SQL based query engine. It could be using a, a machine learning model that's based on a a PyTorch or or TensorFlow. There's so many ways to access data.
And you want to optimize this data at the point of compute for the compute as opposed to before it lands. And that's kind of what data orchestration is in some ways. It is once the data lands, just just land it somewhere in raw form. And then data orchestration is 1 of the tools available along with, you know, the data prep and so on that can enable that data
[00:09:08] Unknown:
for the compute data data the data scientist or the data analyst is using. And, obviously, Alexia is 1 of the tools that fits into this overall category. I'm wondering if you can give a bit of an overview of the total landscape for tools that people might be familiar with. And for anybody who wants to dig deeper into what a Luxio is itself, I'll add a link to the show notes for the interview that I did previously on that. Yeah. Absolutely. Yeah. Luxio is kind of 1 of those,
[00:09:35] Unknown:
you know, 1 of the tools that represents data orchestration. In some ways, given that, data locality or caching is 1 of the aspects, any any cache that is enabling compute in some ways belongs to this category. We also see other projects, in the Kubernetes space like Rook that are call that are, kind of positioning as storage orchestration. In some ways, it's similar. It's basically enabling a range of storage systems to talk, into the computes, on top. And so those are kind of some of the, you know, different, different tools, that are coming up. We see a few other startups, in the space as well that are focusing on, orchestrating and caching and and, the data accessibility, across different storage systems. So, we see this as an emerging category, which which will, over time, in some ways, incorporate, other capabilities as well.
Right now, it's primarily locality through caching, we see that, or data virtualization in some ways to a global namespace, and then accessibility through a a range of APIs, be it, you know, s 3 or HDFS,
[00:10:51] Unknown:
so on. So 1 of the use cases that we highlighted at the outset was the idea of cloud migration for people who have an on premises infrastructure for their data storage and analytics that are trying to take advantage of the elasticity and services of the cloud. But I'm wondering if there are any other particular use cases that data orchestration is well suited for. Yeah. Absolutely. I think,
[00:11:17] Unknown:
data orchestration can be, is relevant in many different cases. And as I've talked with users and our customers, what we hear from them is that in the all cloud use case, if they're all in the cloud, right, then it it it you're you're still in a separated environment. Your s 3 is separate from your compute and you can still pull in that data closer or it might be in a remote region, or a different cloud. Right? If you are on prem, then there's a couple of a couple of, transitions that are happening today. So either they look you know, users and enterprises are looking at, migration to the cloud, which is an extremely daunting and challenging task. Or they might be looking at offloading and moving to object storage on prem as well. And and those aren't as well suited for, for the analytical workloads. And so data orchestration could help with, you know, with any of these. But let's take 1 example. So let's take an example of a migration scenario where a large enterprise, has an existing HDFS deployment. Typically, it's a large cluster. It might be, you know, anywhere from a 100 nodes to 2, 000 nodes. And, we almost always see that it's overloaded.
You know, computers heavily used and so on. And so they're looking for additional capacity, either to offload or, you know, new new things, new workloads, need to be put somewhere else. And and as the data engineers, in some ways, look at this problem and try to solve this problem, they have a couple of options. Right? They can continue to add more hardware, which becomes an expensive proposition. And I think people are trying to, you know, stop really investing on prem and looking at other options. You can say, you know, I can't run this any any new jobs, which is not an option either. I can say, hey. Let me schedule this some some other time at night or, you know, you can play around with optimizations.
But if you need additional compute in the cloud, then there are a couple of approaches. Right? You could, you could do, data copy by workload. So, let's say you have a monthly report that you're running or an, a particular, spot model that you wanna run, and you do a a mass copy, right, of this data into the cloud. Now you have 2 different copies. They might be out of sync. You run your job and then you have to, you know, figure out how what part of that gets copied back and so on. So it's a fairly involved manual kind of error prone process versus, in the case of data orchestration, it's almost like a no copy zone. You don't want to create copies. It's like, you know, copies solve create more problems than they solve in some ways. And, and so you wanna use sync. You wanna keep the data in sync, with the core. So there should be just 1, you know, 1 master in some ways, and the data can be synced back and forth. And that's, that's, makes it easier for somebody to then burst their workload in the cloud. Data can still stay on prem.
The compute might be in the cloud, and then data orchestration kind of, syncs the data back and forth. And that's, that's, helps with taking 1 small step towards the cloud as opposed to, you know, this massive migration which can be, from what we hear, several years long process. So let me just pause there and, yeah, see if, the the the does that answer your question?
[00:14:43] Unknown:
Yeah. Definitely. And I'm wondering if you have any specific instances or an example topology of, either a customer that you've worked with or a story that you've heard from people who are either using Alexia or some of these other tools in this space to give a bit more of a concrete feel for somebody who is maybe still a little uncertain about how they might go about approaching this particular type of problem of having a constraint in terms of the amount of computer storage that they're able to access and then needing to be able to leverage some of these technologies to be able to expand their footprint and expand their capabilities?
[00:15:24] Unknown:
Yeah. Absolutely. Great question. So, actually, recently, a few weeks ago, I presented, with, DBS Bank, and Vitali from DBS Bank. He, we presented about this this use case about hybrid cloud. And they have, a large kind of storage system on prem where they're also running Aluxio with Spark and Presto on top. But, more related with our discussion here, there, there was they were facing these limited situations where there were new projects that were coming up. They needed to analyze data for new projects, and they just couldn't figure out how to do. The specific project that they were they started to look at was a call center project where they have, you know, hundreds of thousands of calls that come in. I think it, and over the year even, you know, even more so annually.
And this there's there's, transcripts. There's audio files. Right? And all of this needs to be analyzed in terms of, what are people calling in for? What are, what are the problems? What is the root cause? Can the customer experience be improved on? And they wanted to use machine learning to, to analyze this and figure out what these root causes were improving to improve the experience. And so they looked at a stack on AWS with, Amazon SageMaker, that, their data scientists used to actually write the models. The models then ran in PySpark on an AWS EMR cluster. And then, AWS, Aluxio was feeding this EMR cluster the the data which actually was coming from on prem, and this was only in memory. Because as a bank, being the largest bank in Southeast Asia, they have a lot of, restrictions about data in the cloud. PII information obviously cannot be moved over. But even even if data is moved, it can only live in memory, cannot be persisted. And so with, with this all these restrictions, they essentially run a Luxeo in memory, and they use it to sync data back and forth, the data that's needed, to run the this machine learning use case, in the Cloud. And and, as needed, they can burst their data, whether it's a 100 nodes or a 1000 node cluster. It's very easy to get going, because, you know, it's just a couple of clicks and it's very easy to get set up in the cloud. So no longer do you need to wait, you know, months for hardware to come in and so on.
Capacity is elastic and on demand. And, the compute scales elastically as well as the data tier, which becomes important. So that's kind of a real world example, to give a little bit more context on how somebody could actually use data orchestration to do this bursting into the cloud. So the obvious
[00:18:07] Unknown:
motivation for using cloud resources is that elasticity and the fact that you have a much shorter time cycle for being able to access that additional capacity. But what are some of the capabilities that you're giving up by using cloud resources in place of your on premises compute and storage infrastructure?
[00:18:28] Unknown:
Good question. So I think that you do have to be thoughtful. Right? Because you have to be the the security paradigms are a little bit different for the cloud. And so, you might need to be a little bit more familiar with those. So some of it might be, skill. Right? New acquiring new skills for the team. Understanding the the right the right protocols, the right authentication mechanisms, etcetera, whether it's using, you know, AWS Active Directory in on the cloud or something else. Right? So I think that's 1 part of it. The second part of it is, if you have, you know, it's the services and the the technologies are, quite different in, you know, in they're they're different services, right, at the end of the day. And if you have an on prem model and it's the a data center mindset, the mindset itself is a little bit different. There is, you have to move from, planning 2, 3 years ahead, right, to, to stay as you go. Right? So there are there's quite a big difference in terms of the, the the budgeting and the the modeling that happens.
You could it's so easy to, run a large cluster and, you know, collect a big budge bill on AWS that you could, you could get into trouble. So there's obviously ways around that. With flexibility, you know, comes great responsibility. And so, so you have to kind of plan for these things and think about these things, ahead of time. And there are a number of technologies
[00:19:59] Unknown:
as well that are coming out that make it possible to replicate some of the, at least, interfaces of cloud environments on physical hardware and on premises infrastructure. And so I'm curious what you've seen as far as organizations trying to replicate some of that interface so to make it easier for when they do span to the cloud so that they don't have to modify or customize their application stack to be able to take advantage of, for instance, an s 3 interface where they might be using MinIO, their on premise infrastructure, or taking advantage of Kubernetes or maybe OpenStack for being able to have some of the flexibility of the cloud on their actual physical compute.
[00:20:45] Unknown:
Yeah. Absolutely. I mean, Kubernetes is is extremely extremely widespread. It definitely for stateless workloads. For stateful workloads, especially in a disaggregated stack, it's it's a little bit complicated and, actually, data orchestration helps with that because let's say you have, data that's, that's on prem in a in a different environment, in different systems. Within Kubernetes, you still don't have a highly elastic data layer. And so you would have to the the data itself lives outside the Kubernetes cluster. Right? And so, actually, we see, Spark in Kubernetes, Presto in Kubernetes quite often where, where they, they're trying to solve the data locality challenge within Kubernetes, and this is on prem.
And so we do see we do see that coming up. I think over time, the the data driven workloads or these, kind of data intensive workloads, will get more productionized for Kubernetes. I think, the stateless workloads like app servers were first tried then came kind of these operational databases that that built, used persistent volumes and built robust production ready environments using Kubernetes. The next kind of set of technologies is is these data driven, frameworks like Spark, Presto, etcetera. And, we're just starting to see that people are trying them out, POC ing them and in, you know, in early stages. Some are in production, but it's still it's still early. And I think that, Kubernetes obviously helps, with that elasticity getting that on prem, and it's simplifies that problem. But at the end of the day, you still need the hardware to to run it on. Right? And and so you will get some a lot more flexibility.
But the the flexibility of the cloud in terms of having unlimited compute, if you need it, it still exists. In terms of object storage, we definitely see a rise in the in the object storage. S 3 made object stores popular, whether they're in the cloud, especially in the cloud on prem, particularly as Hadoop gets disaggregated even more. HDFS workloads will slowly start moving towards, object stores like, main IO, others as well. There's there's several out there. And, and the same thing will apply. You would need, to, you know, kind of restore data locality on top of these object stores, which aren't really, built for advanced analytics.
And, and that we see that as 1, you know, an upcoming or sort of an emerging use case, that that will that will come about maybe,
[00:23:31] Unknown:
6 months to a year down and become more prominent. So you mentioned the specific instance of the work that DBS Bank is doing, but I'm wondering if there are any other common patterns that you've seen for cloud migration projects in the data and analytics space and some of the challenges that they present. And then also I'm wondering if you have any advice on useful metrics for being able to track and determine the overall completion and success criteria for the project? Yeah. In terms of migration,
[00:24:04] Unknown:
we've seen a couple of different approaches. If you have if the mindset is, that the the cloud initiatives in the enterprise are the, you know, paramount initiative in the sense that, like, there's a immediate urgency to move to the cloud. Really, the, the the the only approach that that works is, you know, moving, migrating the entire dataset to the cloud and then running running workloads in the cloud. Right? But, it it's, it is a nontrivial it's a nontrivial kind of, approach. Not everyone has the ability to run both on prem and the cloud at the same time. It obviously doubles costs. But if that is an option, then we have you know, there's there are companies that just go about that that way. Most of the companies, most of the enterprises, the way they they the way they think through it is, workload by workload. Right? And they start off with the low risk workloads and it these are workloads that don't need that that are, that might be ephemeral. So, for example, you just run the workload and, you know, it's done and it's gone. They start off with the ephemeral workloads, then they start off with the the more scheduled workloads that might be adding, additional capacity, additional overhead, at a certain time of the month. For example, at the end of the month, end of the quarter, there are there might be a a lot more reporting, jobs that run. And so those can be methodically moved over to the to the Cloud. Their datasets can be moved over to the Cloud. Now if there's a huge overlap in the datasets across all these workloads, then it becomes harder. And that's where this, the the the 0 copy burst or the the bursting to the cloud would make would would make more sense. And so it depends on you know, these are the different approaches.
It depends on the the mindset of the company, the urgency of moving to the cloud, the goal of it, as well as, the appetite, right, in some ways. There have been some breaches and so on which which has, caused people to think about, okay. Is this the right thing to do? Especially with their enterprise data. And so what we hear is some data, will continue to live on prem, and it will never be moved to the cloud. A lot of data will move to the cloud, but there will be some kind of, IP, almost IP category of data, which is so precious that that it will continue to remain on prem. Now the other part of the question you asked is metrics. Right? So metrics in terms of how do you, how do you gauge, success or progress to this process? The at the highest level, there is, you know, a total dataset, the size of the total dataset.
But then you kind of come down to the next level of granularity is the number of tables you're managing. And, and and then you can look at that on a a business unit basis. Because, invariably, this data is split by, you you know, finance and marketing and product and and and it's, everyone you know, all this data is in a single data lake or in a few different data lakes. Right? And so you can go organization by organization. And within the organization, it is the, it's the number of tables that have been migrated or not, and then which reports are associated with those tables. Right? So that you you never are in a situation where, if a query, if a report comes in if a query comes in that the dataset is not available, there might be a period of time where you have to have both available, and then you're you're playing catch up. At 1 point, you say no more, you know, no more new, new, data gets generated, and then you start adding the new data some of the in the cloud, and then that workload permanently gets run-in the cloud. So those are some of the things that we've you know, we we hear as we talk to users and and try to identify workloads, that they would like to burst or that would be easiest to begin with. Spark, ephemeral, jobs for, particularly for machine learning, modeling, regression analysis, Monte Carlo analysis, those kind are relatively easy.
And then you kind of move on from that 0.1
[00:28:18] Unknown:
of the other things that plays into some complexities of these migration efforts is that the cloud inherently requires a different approach to how you design and implement infrastructure and applications because of the fact that a lot of the resources are ephemeral, and there are no guarantees for durability unless you're talking about the storage layer in terms of, like, object storage or block storage. So I'm wondering what you have seen as far as pain points for companies who are going through these projects and trying to work on hybrid implementations or cloud migrations and some of the ways that they approach employee education for being able to design and implement effective systems for these types of migrations and the goals that they have.
[00:29:06] Unknown:
Yeah. I think that in most cases, in most scenarios, they look at there is, they do it use case by use case. Right? So they'll they'll figure out, like, in the case of DPS Bank, like we talked about, it was the call center project because there was a need. They knew that it would need, probably significant amount compute resources to run these workloads. And it's, it might be nontrivial. Right? And it affects the on premise systems if you suddenly add a workload that's unpredictable. And so, workloads that are unpredictable and you where you don't necessarily know how much of an IO, in, overhead or CPU overhead it's gonna have on your existing system become kind of the first, the the the first, kind of category that they will start trying out or from. Right? And then the program kind of begins from there, where new applications, the you know, in in many cases, they're really not allowed to be added to their existing on prem Hadoop cluster. They just say, you know, this is on data that's, it's a it's a specific dataset. We don't have an additional compute capacity. So all new workloads, on for this for this, that that that satisfy this set of requirements, they will they will be run-in the cloud.
And, and so, that that's the first step. So it's new new workloads, new jobs go move to the cloud. And then the the moving the rest of it is the harder part. Right? And so for that, you've gotta have a a very kind of thoughtful migration program as we talked about earlier to, to create a process from organization to organization, and run POCs even to see the implications. Right? Because the first step as a as a platform organization, you're providing a service to other organizations. And, and because of that, you have to ensure that when your environment changes, it's not gonna have, it it's gonna be at least as good. Right? And so performance, becomes a criteria, Feature set becomes a criteria.
And, security, the level of security, though, that becomes a criterion. So based on that, you create the migration program to figure out what can be burst immediately, or what needs to be physically moved over and use a more of a lift and shift approach.
[00:31:28] Unknown:
And we've highlighted a few different tools in the data orchestration space in the form of Eloxio and Presto and Spark, etcetera. But I'm wondering what you see as being some of the missing pieces or gaps in the landscape where there's an opportunity for either extending some of the existing tools or building a new tool or platform to fill that particular gap and any efforts that you're seeing on those fronts?
[00:31:53] Unknown:
Yeah. You know, good question. We have been, as a as a product person, I'm constantly looking for exactly this, where are the gaps, and and how can we fill them. Obviously, from a, an Aluxio perspective, but also broadly broader than that. And what we're seeing is that, the world of advanced analytics, you know, in the Hadoop world, it moved to files. Right? So everything is kind of a parquet file, ORC file. It's a file based approach as opposed to a tables based approach. And what this means is that, you you're kind of dependent on the, Hive metastore or or, you know, the catalogs like that to be able to do really do anything. So Presto, for example, needs it. Spark SQL would need it. And so there's an interesting space there where there are projects emerging, like Iceberg and so on, where, you can actually create, get the schema out of your parquet files and, do do things more efficiently and optimize.
And so I think that space is ripe and it'll evolve and, you know, we're thinking about that as well. In fact, we might have some fun announcements at our data orchestration summit that's coming up, that we're, that we're, hosting in, November 7th. And so that might be 1 of the areas. The other area I see that's interesting is, is data transformation as well. So 1 is the catalog services. How do you manage structured metadata in a land of files and objects? Right? The second 1 is more around transformation where, up until now, what was happening is that, you know, when you move data from a really operational system, your OLTP system, you would go through this ETL process and prep it, convert it to a star schema and, you know, and and kind of, do the optimizations beforehand and then land it in your data warehouse.
And then you would just query it. Right? And then that was the holy grail, was the star schema, snowflake schema, and so on. Then now what's happening is there's so much data that it's somehow just landing. Right? It's landing in the data warehouse or the data lake. And, so instead of spending time optimizing before it lands, you might wanna do some event to event processing like Kafka, you know, kind of event based analysis beforehand. But, the optimization of how the data stored itself is no longer as important because you don't need you don't you're not depending on a star scheme as such anymore. And so you the optimization for the transformation needs to happen after it lands. And so what we see is that you will actually optimize on the way to compute. So when compute needs the data, that's where it needs to be optimized on demand and convert it. If it could be a better file format, you convert it to a better file format.
If it's, if it needs to be compressed, you can compress it. If it needs to be coalesced, you can coalesce all the files. And so the prep and the transformation will actually happen on demand as you're consuming it. And I think there's, interesting things that can be done in that space that we're we're looking at as well. But beyond the Luxio, I think there will be a lot more optimization that happens for the compute, to from a performance perspective, because every compute is different and every compute, favors different formats and different different approaches and compression techniques and so on. So I think that's another, space where we will see a lot more automation.
So less manual work, a lot more automation,
[00:35:29] Unknown:
and more tools will come up. And what are some of the overall industry and market trends that are both driving the overall data orchestration movement as well as some of the ways that data orchestration fits into those broader ideas. And what are some of the particular projects or trends that you are personally most excited by?
[00:35:52] Unknown:
Yeah. So, you know, as we started off, actually, the conversation, we started talking about how it's become more disaggregated. And so this the the trend of a separated storage and computer disaggregated stack, I think that will continue and, you know, there might be some tightly, vertically integrated systems, but, large majority of, platform teams will use disaggregated architectures and that's certainly pushing and driving data orchestration. Another 1 along those lines is just self-service data and making, data analysts and data scientists a lot more efficient by giving them access to the data whenever and wherever they need it. And and I think that's another 1 that's, that's helping that, you know, vice versa. Data orchestration enables it and the trend also, helps make data orchestration more real.
And then in terms of, the other question you asked was around, I think, the the what I'm most interest excited about. Right? The the trends and and and so on. So I think that the the data revolution is kind of at the beginnings. We're just now you know, we've gone through this the this Hadoop ecosystem, many iterations of it where we went through MapReduce to Hive to Spark SQL to Presto. There's all there's there will be constant, you know, innovation that happens on the framework side. I think that, move moving to the cloud, the resources available in the cloud, the flexibilities of the cloud will enable a new new range of, a new set of, innovations to in get insights from data significantly faster.
And so because of these operational efficiencies, time to value and time to insight will be significantly faster. It could be, you know, using GPUs more efficiently. And at scale. It can be query engines that are highly distributed, at scale. And I think that there's, this this entire kind of data, ecosystem, will continue to emerge as it has. And if you look back, you you still see that mainframes are still around and Teradata's still around. And, but at the same time, the the amount of data and the the capabilities that data has, is just, exploding. So, continue to, continue to be very excited in this space and, in some ways, continue, living up to my initials.
[00:38:29] Unknown:
So for somebody who wants to learn more about the overall space of data orchestration and the benefits that it can provide and some of the overall industry trends that it's driving that are driving it. What are some of the resources that you would recommend? Yeah. Absolutely. So, Alokshia is an open source project. And so, actually, we have a lot of information,
[00:38:50] Unknown:
openly available, on our, on our sites and, you know, it's, free down to download community edition. We also have, we are putting together a data orchestration summit that is, planned for November 7th, and it's it's it's a great lineup of speakers. We have, top leaders from, from Netflix, from, O'Reilly, DBS Bank, Walmart, coming in to present. And, we actually have a special offer, I think as well. So, you can use the, the code podcast for 25% off on our registration. And, the data orchestration summit will help you understand not just about Alacxio, a little bit more about Alacxio, but the other data engineering tools that are out there, that can improve, the efficiency and help, these unsung heroes that are the data engineers.
[00:39:49] Unknown:
And are there any other aspects of data orchestration or hybrid cloud and cloud migration projects that we didn't discuss yet that you'd like to cover before we close out the show? I think we talked about, a lot. So,
[00:40:02] Unknown:
Yeah. It was, a great conversation. We can all you know, we we have a lot of information on our on our website and so on. And, personally looking forward to seeing, folks at the summit. So thanks so much for, having me on, Tobias.
[00:40:20] Unknown:
And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. So as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Alright.
[00:40:37] Unknown:
The hardest question at the end. I think that there are there are many gaps. We, you know, we talked about some of them, I think. Metadata management, in general, across structured and unstructured data, is something that needs to be tackled a lot more. At the end of the day, the value and the insight of of the data is in the metadata. A lot more, lineage and and tracking and, really leveraging, of metadata can be done across the enterprise, and it's not a hard it's it's a it's a very hard problem to solve, and it's being solved in silos today. And, and it'll be interesting to see how that space emerges and that gap closes. Because at the end of the day, you know, if you access data, if you manage data, you kind of have to have access to the the metadata first.
And, and I think that's something that, will need to be solved in the short term. Yeah. And there's obviously others, out there, other gaps that, you know, will will emerge and,
[00:41:41] Unknown:
innovation will continue. Yeah. Well, thank you very much for taking the time today to join me and share your thoughts on the space of data orchestration and some of the ways that it can have some concrete benefits. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Absolutely. Thanks so much. Bye bye. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsors
Interview with Deepti Borkar Begins
Understanding Data Orchestration
Use Cases for Data Orchestration
Challenges and Considerations in Cloud Migration
Employee Education and Migration Strategies
Industry Trends and Future of Data Orchestration
Resources and Closing Remarks