Data Orchestration For Hybrid Cloud Analytics - Episode 103

Summary

The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.  


linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what you mean by the term "Data Orchestration"?
    • How does it compare to the concept of "Data Virtualization"?
    • What are some of the tools and platforms that fit under that umbrella?
  • What are some of the motivations for organizations to use the cloud for their data oriented workloads?
    • What are they giving up by using cloud resources in place of on-premises compute?
  • For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments?
  • What are some of the common patterns for cloud migration projects and what challenges do they present?
    • Do you have advice on useful metrics to track for determining project completion or success criteria?
  • How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals?
  • Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?
    • What are some of the common pain points that organizations encounter when working on hybrid implementations?
  • What are some of the missing pieces in the data orchestration landscape?
    • Are there any efforts that you are aware of that are aiming to fill those gaps?
  • Where is the data orchestration market heading, and what are some industry trends that are driving it?
    • What projects are you most interested in or excited by?
  • For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, or you want to test out the project to hear about on the show lead somewhere to deploy it, so check out our friends at Lynn ODE with 200 gigabit private networking, scalable shared block storage and afford a gigabit public network you'll get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash Linux that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also sponsored by data coral They provide an AWS native server lists data infrastructure that installs and your VPC data coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data corals customers report their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance. Roku Murthy founder and CEO of data core Oh built data infrastructures at Yahoo and Facebook scaling from mere terabytes to petabytes of analytic data. He started data coral with the goal to make sequel the universal data programming language, visit data engineering podcast.com slash data coral today to find out more. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as diversity Caribbean global intelligence Alexey Data Council. Upcoming events include the combined events of the data architecture summit and graph forum, the data orchestration summit and the data Council in New York City. Go to data engineering podcast.com slash conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macey. And today I'm interviewing deep D bore car about data orchestration and how it helps in migrating data workloads to the cloud. So DP, can you start by introducing yourself?
Dipti Borkar
0:02:29
Yes. Hi, Tobias. This is that the Vice President of Product and Marketing here at Alex Yo, and great to be with you today?
Tobias Macey
0:02:37
And do you remember how you first got involved in the area of data management?
Unknown
0:02:40
Yes. So a fun fact about that, if you look at my initials are actually DB database. And so I was actually I was actually born with that talent. But in reality, I was actually working on my master's thesis in the database lab at UC San Diego. Many, many years ago, I won't tell you how long and that's kind of how I got interested in it. I started off writing kernel code for DB to distributed for Linux and I was there for many years. And then over time moved in, you know, the the data management database space has emerged quite a bit over time structured data, unstructured semi structured data, I spent many years that couch base and no sequel database Mark logic. And at a GPU analytics company right before this analog Sue is very interesting because it kind of took the, the the database in some ways, and each layer has now become its own layer, the query engine, you can think of it as presto sparks equal, Alok CO is kind of like, you know, buffer pool in some ways, and you have the storage engines and so, so it still kind of kept me in that database path. And I got really interested in Alex, you're given the cloud movement that we're seeing in the market as well. Yeah, it's definitely
Tobias Macey
0:03:58
interesting how a lot of the products that have previously been very vertically integrated have been starting to get sort of exploded into different projects where the database is getting exploded, as you said, into the query planning layer, and then the data storage layer is separate. And then also for things like ETL frameworks, where it used to be an end to end solution, those have been exploded into a programming layer that then sits on top of an execution substrate, and then the actual data that it's interfacing with is has somewhere else. So it's definitely interesting how the cloud has just made us rethink the way that we build all these different applications and previously integrated stacks.
Unknown
0:04:38
Yeah, absolutely. I think the you know, there's a big architectural shift. In some ways people are, you know, calling it separation of compute and storage, but it's not just compute and storage, right? It's all the layers in the middle. And, and so, with the cloud, the rise of the object stores, data silos, increasing data exploding in general I think there's just such a need for new architectures that can manage this, this vastness of data that we're still at the beginnings of, frankly. So it's a very interesting space to be in. And and the challenge that we are dealing with Alex to, obviously, is being between compute and storage, and enabling this separation and helping orchestrate data, making data more available and friendly, if you will, to the compute.
Tobias Macey
0:05:25
So we've touched on this term a couple of times as far as data orchestration. And I'm wondering if you can give a bit of a description about how you define the term so that we can use that as a launching point for the rest of the conversation.
Unknown
0:05:37
Yeah, sure, absolutely. So what we say there orchestration is is a method of methodology of enabling data for the purposes of compute, and this has many different aspects bundled in. So it's a restoring data locality. data locality is a big aspect of computing and in these environments, at gets spread, spread out separated out data accessibility, how's the same data accessible in many, many different ways, it's going to be more important to the computational frameworks. And then finally, data elasticity in in some ways, it's a little bit like data virtualization, but it's it's a global namespace are a common way to access a range of silos of data that might live underneath. And so that's kind of how we think of data orchestration as it's it's an enabler for compute to make data more available, more accessible, and elastic. To compute.
Tobias Macey
0:06:38
There's another term that I've come across a few times that data virtualization and I'm wondering if you can give a bit of a comparison between data orchestration and data virtualization so that for anybody who happens to come across them later on can clearly understand what they're dealing with.
Unknown
0:06:55
Absolutely. Yeah. So data virtualization in some ways isn't new because it's You know, whenever there are silos you have, you might need to abstract those. And there's a layer in the middle. That's a, it's a logical layer, if you will. It's a logical layer that allows you to abstract the data silos underneath. And that's really what data virtualization is. It's kind of the common namespace, a global namespace across those silos be relational databases, be it file systems be its charge structure data. But orchestration, in some ways is more than just data virtualization. Because while you need the abstraction, and it's very important to abstract away where the physical data, the bits and bytes and blocks actually live. And there are other aspects that become more important in this as we talked about disaggregate at stack when you're let's say a spark job is running in in, in a in a cloud and your data is in a remote s3 in a remote region or in HDFS on prem. Then you Obviously mean that abstraction because you don't want to deal with multiple storage systems. But you also need to pull your data closer to your compute. And you also need to make it available with using the right API's. And so that's where, in some ways data virtualization is, is one of the elements of orchestration. But orchestration needs to miss so much more. Because once the data lands in a data lake in a Data Silo, then it's going to be used and accessed in many, many different ways. It could be using a sequel based query engine, it could be using a machine learning a model that's based on a pytorch or TensorFlow, there's so many ways to access data. And you want to optimize this data at the point of compute for the compute as opposed to before it lands. And that's kind of what data orchestration is, in some ways. It is once the data lands just just landed somewhere and raw form. And then data orchestration is one of the tools available along with, you know, the Data Prep and so on, that can enable that data for the compute data data data scientist or the data analyst is using.
Tobias Macey
0:09:12
And obviously Alexia is one of the tools that fits into this overall category. I'm wondering if you can give a bit of an overview of the total landscape for tools that people might be familiar with and for anybody who wants to dig deeper into what Alex you is itself? I'll add a link to the show notes for the interview that I did previously on that
Unknown
0:09:31
Yeah, absolutely. Yeah. Last year was kind of one of those, you know, one of the tools that represents data orchestration, in some ways given that data locality or caching is one of the aspects any any cash that is enabling compute in some ways belongs to this category. We also see other projects in the Kubernetes space like group that are that are kind of positioning as storage or because In some ways, it's similar. It's basically enabling a range of storage systems to talk to the compute on top. And so those are kind of some of the, you know, different different tools that are coming up. We see a few other startups in the space as well that are focusing on orchestrating and caching and and the data accessibility across different storage systems. So we see this as an emerging category which, which will, over time, in some ways incorporate other capabilities as well. Right now, it's primarily locality through caching. We see that or data virtualization in some ways to a global namespace, and then accessibility through a range of API's via, you know, s3 or HDFS, so on
Tobias Macey
0:10:51
so one of the use cases that we highlighted at the outset was the idea of cloud migration for people Who have an on premises infrastructure for their data storage and analytics that are trying to take advantage of the elasticity and services of the cloud? And I'm wondering if there are any other particular use cases that data orchestration is well suited for?
Unknown
0:11:15
Yeah, absolutely. I think the orchestration can be is relevant in many different cases. And as I've talked with users, and our customers are what we hear from them as that in the old cloud use case, if they're all in the cloud, right? Then if you're still in a separated environment, your s3 is separate from your computer, and you can still pull in that data closer, or it might be in a remote region, or a different cloud, right? If you are on prem, then there's a couple of couple of transitions that are happening today. So either there Look, you know, users and enterprises are looking at migration to the cloud, which is an extremely daunting and challenging task, or they might be looking at offload And moving to object storage on prem as well. And those aren't as well suited for, for the analytical workloads. And so data orchestration could help with, you know, with any of these, but let's take one example. So let's take an example of a migration scenario where a large enterprise has an existing HDFS deployment, typically, it's a large cluster might be, you know, anywhere from 100 nodes to 2000 nodes, and we all see that it's overloaded. You know, compute is heavily used and so on. And so they're looking for additional capacity, either to offload you know, new, new things, new workloads, or need to be put somewhere else and, and as the data engineers, in some ways, look at this problem and try to solve this problem. They have a couple of options, right? They can continue to add more hardware, which becomes an expensive proposition and I think people are trying to figure you know, stop really investing on prem, and looking at Other options are you can say, you know, I can't run this any any new jobs, which is not an option. Either I can say, hey, let me schedule this some other time at night or, you know, you can play around with optimizations. But if you need additional computer in the cloud, then there are a couple of approaches, right? You could, you could do data copy by workload. So let's say you have a monthly report that you're running or and a particular spot model that you want to run. And you do a mass copy right of this data into the cloud. Now you have two different copies, they might be out of sync, you run your job and then you have to, you know, figure out how what part of that gets copied back and so on. So it's, it's a fairly involved manual kind of error prone process versus in the case of data orchestration, it's almost like a no copies own. You don't want to create copies. It's like you know, copies saw create more problems than this call in some ways and You want to use sync, you want to keep the data in sync with the core. So that should be just one, you know, one master in some ways, and the data can be synced back and forth. And that's, that's makes it easier for somebody to then burst their workload in the cloud data can still stay on prem, the compute might be in the cloud. And then data orchestration kind of sinks the data back and forth. And that's, that helps. We're taking one small step towards the cloud, as opposed to, you know, this massive migration, which can be from what we hear several years long process. So let me just pause that and yeah, see if that does that answer your question?
Tobias Macey
0:14:43
Yeah, definitely. And I'm wondering if you have any specific instances or an example topology of their a customer that you've worked with, or a story that you've heard from people who are either using Alexia or somebody These other tools in the space to give a bit more of a concrete feel for somebody who is maybe still a little uncertain about how they might go about approaching this particular type of problem of having a constraint in terms of the amount of computer storage that they're able to access and then needing to be able to leverage some of these technologies to be able to expand their footprint and expand their capabilities.
Unknown
0:15:23
Yeah, absolutely. Great question. So actually, recently, a few weeks ago, I presented with DBS bank and Vitaly from DBS bank, he, we presented about this, this use case about hybrid cloud and they have a large kind of storage system on prem where they're also running Alexia with spark and pressed on top but more related with our discussion here. They there was they were facing these limited capacity, situations where there were new projects that were coming up, they needed to analyze data for new projects, and the Just couldn't figure out how to do those specific projects that they were they started to look at it was a call center project where they have, you know, hundreds of thousands of calls that come in, I think, and over the years, even, you know, even more so annually. And this there's there's transcripts, there's audio files, right. And all of this needs to be analyzed in terms of what are people calling in for what are what are the problems, what is the root cause? Can a customer experience be improved on and they wanted to use machine learning to, to analyze this and figure out what these root causes were improving to improve the experience. And so they looked at a stack on AWS with Amazon Sage maker that their data scientist used to actually write the models, the models and ran and pi spark on an AWS EMR cluster. And then AWS, Alok co was feeding this EMR cluster the the data, which actually was coming from on prem And this was only in memory. Because as a bank, being the largest bank in Southeast Asia, they have a lot of restrictions about data and the cloud. PII information obviously cannot be moved over. But even even if data is moved, it can only live in memory or cannot be persisted. And so with with this, all these restrictions, they essentially run a locks you in memory. And they use it to sink data back and forth the data that's needed to run the this machine learning use case, in the cloud. And and as needed, they can burst their data, whether it's 100 nodes are 1000 node cluster, it's very easy to get going. Because, you know, it's just a couple of clicks. And it's very easy to get set up in the cloud. So no longer do you need to wait, you know, months for hardware to come in, and so on. capacity is elastic and on demand, and the compute scales elastic Lee as well as the data tier which becomes important. So that's kind of a real world example, to give a little bit more calm. context on how somebody could actually use data orchestration to do this bursting into the cloud.
Tobias Macey
0:18:05
So the obvious motivation for using cloud resources, is that less diversity and the fact that you have a much shorter time cycle for being able to access that additional capacity, but what are some of the capabilities that you're giving up by using cloud resources in place of your on premises compute and storage infrastructure?
Unknown
0:18:28
Hmm, good question. So I think that you do have to be thoughtful, right? Because you have to go the the security paradigms are a little bit different for the cloud. And so you might need to be a little bit more familiar with those. So some of it might be skill, right? New, acquiring new skills for the team, understanding the right the right protocols, the right authentication mechanisms, etc. Whether it's using, you know, AWS Active Directory and on the cloud or something else, right. So I think that's one part of it. The second time Part of it is, if you have, you know, it's the services and the technologies are quite different in, you know, in their, their different services right at the end of the day, and if you have an on prem model, and it's the data center mindset, the mindset itself is a little bit different. There is a new have to move from planning to three years ahead, right to just pay as you go. Right. So there are there's quite a big difference in terms of the, the, the budgeting and the modeling that happens. You could it's so easy to run a large cluster and, you know, collect a big budget bill. You could, you could get into trouble. So there's obviously ways around that with flexibility, you know, comes great responsibility. And so you have to kind of plan for these things and think about these things ahead of
Tobias Macey
0:19:56
time. And there are a number of technologies as well that are Coming out that makes it possible to replicate some of the least interfaces of cloud environments on physical hardware and on premises infrastructure. And so I'm curious what you've seen as far as organizations trying to replicate some of that interface to make it easier for when they do span to the cloud so that they don't have to modify or customize their application stack to be able to take advantage of, for instance, an s3 interface where they might be using Minaj to their on premises infrastructure, or taking advantage of Kubernetes or maybe OpenStack for being able to have some of the flexibility of the cloud on their actual physical compute.
Unknown
0:20:44
Yeah, absolutely. I mean, Kubernetes is is extremely, extremely widespread. It definitely for stateless workloads, for state full workloads, especially in a decided to get a start. It's a little bit complicated, and actually data orchestration helps with that. Because let's say you have data that's, that's on prem and in a different environment in different systems. Within Kubernetes, you still don't have a highly elastic data layer. And so you would have to the data itself lives outside the Kubernetes cluster, right? And so, actually, we see spark in Kubernetes, presto and Kubernetes. Quite often, where were they, they are trying to solve the data locality challenge within Kubernetes. And this is on prem. And so we do see we do see that coming up. I think, over time, the data driven workloads or these kind of data intensive workloads, will get more production eyes for Kubernetes. I think the stateless workloads like app servers will first try that game kind of these operational databases that that built, use persistent volumes and build robust production ready environments using Kubernetes. The next kind of set of technologies is is these data driven frameworks like Spark, presto, etc. And we're just starting to see that people are trying them out, PO seeing them and in, you know, in early stages, some are in production, but it's still it's still early. And I think that Kubernetes obviously helps with that elasticity, getting that on prem. And it's simplifies that problem. But at the end of the day, you still need the hardware to run on right and, and so you will get some a lot more flexibility. But the the flexibility of the cloud in terms of having unlimited compute, if you need it, it still exists. In terms of object storage, we definitely see a rise in RN, the object store s3 made object stores popular, whether they're in the cloud, especially in the cloud, on prem, particularly as Hadoop gets decided to get it even more. HDFS workloads will slowly start moving towards So object stores like mean IO, others as well, there's there's several out there and, and the same thing will apply, you would need to, you know, kind of restore data locality on top of these object stores which aren't really built for advanced analytics. And, and that we see that as one, you know, an upcoming or sort of an emerging use case that that will that will come about maybe six months to a year down and become more prominent.
Tobias Macey
0:23:33
So you mentioned the specific instance of the work that DBS bank is doing, but I'm wondering if there are any other common patterns that you've seen for cloud migration projects in the data and analytics space and some of the challenges that they present and then also, I'm wondering if you have any advice on useful metrics for being able to track and determine the overall completion and success Best criteria for the project?
Unknown
0:24:01
Yeah. In terms of migration, we've seen a couple of different approaches, if you have if the mindset is that the cloud initiatives in the enterprise are the, you know, Paramount initiative in the sense of like, there's a immediate urgency to move to the cloud, really the, the, the only approach that that works is, you know, moving, migrating the entire data set to the cloud and then running, running workloads in the cloud, right. But it's a it is a non trivial, it's a non trivial kind of approach. Not everyone has the ability to run both on prem and the cloud. At the same time, it obviously doubles costs. But if that is an option, then we have you know, there's there are companies that just go about that that way, most of the companies most of the enterprises, the way they the way they think through it is a workload by workload, right? And they start off With a low risk workloads, and it, these are workloads that don't need that that are, that might be ephemeral. So for example, you just run the workload and you know, it's done and it's gone. They start off with the ephemeral workloads, then they start off with the the more scheduled workloads that might be adding additional capacity additional overhead at a certain time of the month, for example, at the end of the month, end of the quarter, there, there might be a lot more reporting, jobs that run and so those can be methodically moved over to the to the cloud, their data sets can be moved over to the cloud. Now, if there is a huge overlap in the data sets across all these workloads, then it becomes harder, and that's where this the the zero copy burst or the the bursting to the cloud would make would make more sense. And so depends on you know, these are the different approaches. It depends on the the mindset of the company. the urgency of moving to the cloud, the goal of it as well as the appetite, right? In some ways, there have been some breaches and so on, which, which has caused people to think about, okay, is this the right thing to do, especially with their enterprise data. And so what we hear is some data will continue to live on prem, and it will never be moved to the cloud, a lot of data will move to the cloud, but there will be some kind of IP almost IP category of data, which is so precious that that it will continue to remain on prem. And then the other part of the question you asked is metrics, right? So metrics in terms of how do you how do you gauge success or progress to this process, the other highest level, there is, you know, total data set the size of the total data set, but then you kind of come down to the next level of granularity is the number of tables you're managing and, and and then you can look at that on a business. This unit basis because invariably, this data is played by, you know, finance and marketing and product. And it's everyone, you know, all this data is in a single data lake or in a few different data lakes, right. And so you can go organization, by organization and within the organization, it is the, it's a number of tables that have been migrated or not. And then which reports are associated with those tables, right? So that you, you never are in a situation where if a query, if a report comes in, if a query comes in that the data set is not available, there might be a period of time where you have to have both available and then you're playing catch up at one point, you say, no more, you know, no more new, new data gets generated, and then you start adding the new data from the in the cloud, and then that workload permanently gets run in the cloud. So those are some of the things that we you know, we here as we talk to users, and try to identify with workloads that they would like to burst that would be easiest to begin with spark ephemeral jobs for, particularly for machine learning, modeling, regression analysis, Monte Carlo analysis, those kinds are relatively easy. And then you kind of move on from that point.
Tobias Macey
0:28:17
One of the other things that plays into some complexities of these migration efforts is that the cloud inherently requires a different approach to how you design and implement infrastructure and applications because of the fact that a lot of the resources are ephemeral. And there are no guarantees for durability, unless you're talking about the storage layer in terms of like object storage or block storage. So I'm wondering what you have seen as far as pain points for companies who are going through these projects and trying to work on hybrid implementations or cloud migrations and some of the ways that they approach employee education for being able to design and implement effective systems for these types of migrations and goals that they have.
Unknown
0:29:05
Um, yeah, I think that in most cases in most scenarios, they look at their is they do it use case by use case, right. So they'll they'll figure it out, like in the case of DBS bank, when we talked about it was the call center project, because there was a need, they knew that it would need probably significant amount of compute resources to run these workloads. And it's a it might be non trivial, right? And it affects the on premise systems if you suddenly add a workload that's unpredictable. And so workloads that are unpredictable and you where you don't necessarily know how much of an IO in overhead or CPU overhead it's going to have on your existing system become kind of the first the first kind of category that they will start trying out or from right. And then the program kind of begins from there were new applications that you know, in many cases, they're really not allowed to be added to the existing on prem or do cluster, they just say, you know, this is on data, that it's a, it's a specific data set, we don't have any additional compute capacity. So all new workloads on for this, for this, that satisfy this set of requirements, they will, they will be run in the cloud. And and so that that's the first step. So it's new new workloads, new jobs, go move to the cloud, and then the moving the rest of it is the harder part right. And so, for that, you got to have a very kind of thoughtful migration program, as we talked about earlier, to to create a process from organization to organization and Ron PO sees even to see the implications, right, because the first step as a as a platform organization, you're providing a service to other organizations. And and because of that, you have to ensure that when your environment changes It's not going to have, it's going to be at least as good, right. And so performance becomes a criteria features that becomes a criteria and security, the level of security, that becomes the criteria. And so based on that, you create the migration program just figured out what can be burst immediately, or what needs to be physically moved over and use a more of a lift
Tobias Macey
0:31:26
and shift approach. And we've highlighted a few different tools in the data orchestration space in the form of Alexia and presto and spark etc. But I'm wondering what you see as being some of the missing pieces or gaps in the landscape where there's an opportunity for either extending some of the existing tools or building a new tool or platform to fill that particular gap and any efforts that you're seeing on those friends?
Unknown
0:31:53
Yeah, you know, a good question we have been as a product person, I'm constantly looking for Actually this Where are the gaps? And, and how can we fill them for obviously, from an intellectual perspective, but also broadly broader than that. And what we're seeing is that the world of advanced analytics, you know, in the Hadoop world, it moves to files, right? So everything is kind of a RK file or c file. It's a file based approach, as opposed to a tables based approach. And what this means is that you you're kind of dependent on the hi mera store or or, you know, the catalogues like that, to be able to do really do anything. So presto, for example, needs a sparks equal would need it. And so that's an interesting space, there were there are projects emerging, like iceberg and so on where you can actually create, get the schema out of your parking files and do do things more efficiently and optimize. And so I think that space is ripe and it will evolve and, you know, we're thinking about that as well. In fact, we might have some fun announcements at our data orchestration, so Amen, that's coming up that we're, that we're hosting in November 7. And so that might be one of the areas. The other area I see. That's interesting is, is data transformation, as well. So one is the catalog services, how do you manage structured metadata in a land of files and objects, right? The second one is more than on transformation. But up until now, what was happening is that, you know, when you move data from an operational system, your oil TP system, you would go to this ETFs process and prep it converted to a star schema, and you know, and, and kind of, do the optimizations beforehand and then landed in your data warehouse, and then you would just quit it, right? And then that was the holy grail was the star schema, snowflake schema and so on, then, now what's happening is there's so much data that it's somehow just landing, landing in the data warehouse or the data lake and so you Instead of spending time optimizing before it lands, you might want to do some event event processing like Kafka, you know, kind of event based analysis beforehand. But the optimization of how the data stored itself is no longer as important because you don't need you don't you're not depending on a star scheme as such anymore. And so you the optimization for the transformation needs to happen after it lands. And so what we see is that you will actually optimize on the way to compute so when compute needs the data, that's what it needs to be optimized on demand, and convert it if it could be a better file format, you convert it to a better file format. If it's if it needs to be compressed, you can compress it. If it needs to be coalesced. You can call us all the files. And so the prep and the transformation will actually happen on demand as you're consuming it. And I think there's interesting things that can be done in that space that will be We're looking at as well. But beyond the Luxor, I think there will be a lot more optimization that happens for the compute to from a performance perspective, because every compute is different. And every compute favors different formats and different different approaches and compression techniques and so on. So I think that's another space where we will see a lot more automation. So less manual work, a lot more automation, and more tools will come up.
Tobias Macey
0:35:30
And what are some of the overall industry and market trends that are both driving the overall data orchestration movement, as well as some of the ways that data orchestration fits into those broader ideas? And what are some of the particular projects or trends that you are personally most excited by?
Unknown
0:35:51
Oh, yeah, so um, you know, as we started off, actually, the conversation we start talking about how it's become more disaggregate it. And so this the train Separated storage and computer disaggregate at stack, I think that will continue. And you know, there might be some tightly vertically integrated systems. But a large majority of platform teams will use this aggregated architectures. And that's certainly pushing and driving data orchestration. Another one along those lines is just self service data and making data analysts and data scientists a lot more efficient by giving them access to the data whenever and wherever they need it. And, and I think that's another one that's, that's helping that, you know, vice versa, data orchestration enables that. And the trade also helps make data orchestration more real. And then in terms of the other question you asked was around, I think the, the, what I'm most interested, we're excited about right, the trends and so on. So
Dipti Borkar
0:36:57
I think that
Unknown
0:36:59
the The data revolution is kind of at the beginnings. We're just now you know, we've gone through this the Hadoop ecosystem, many iterations of it, where we went through MapReduce to hire to spark sequel to presto, there's, there's there will be constant. And you know, innovation that happens on the framework side, I think that movement moving to the cloud, the resources available in the cloud, the flexibilities of the cloud, will enable a new new range of a new set of innovations to in get insights from data significantly faster. And so because of these operational efficiencies, time to value and time to insight will be significantly faster. It could be, you know, using GPUs more efficiently and at scale. It can be query engines that are highly distributed at scale. And I think that there's this this entire kind of data ecosystem will continue to emerge as it has. And if you look back, you still see that mainframes are still around and Teradata still around. And but at the same time, the the amount of data and the the capabilities that data has, it's just exploding. So continue to continue to be very excited in this space and in some ways continue living up to my initials.
Tobias Macey
0:38:28
So for somebody who wants to learn more about the overall space of data orchestration and the benefits that it can provide, and some of the overall industry trends that it's driving that are driving it, what are some of the resources that you would recommend?
Unknown
0:38:43
Yeah, absolutely. So last year is an open source project. And so actually we have a lot of information openly available on our on our site and you know, it's free, you're down to download Community Edition. We also have we are putting together a Data orchestration summit that is planned for November the seventh and it's it's a great lineup of speakers. We have thought leaders from, from Netflix from O'Reilly DBS bank, Walmart coming into present. And we actually have a special offer, I think as well. So you can use the code podcast for 25% off on our registration. And the data orchestration summit will help you understand not just about Alex, you a little bit more about Alex, you but the other data engineering tools that are out there that can improve the efficiency and help these unsung heroes that are the data engineers.
Tobias Macey
0:39:48
And are there any other aspects of data orchestration or hybrid cloud migration projects that we didn't discuss yet that you'd like to cover before we close out the show?
Unknown
0:39:58
I think we talked about A lot. So yeah, it was a great conversation. We are, you know, you know, we we have a lot of information on our on our website and so on. And personally looking forward to seeing folks at the summit. So thanks so much for having me on Tobias. And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. So as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today. All right. I'm the hardest question at the end. I think that there are there are many gaps. We you know, we talked about some of them. I think, metadata management in general, across structured and unstructured data is something that needs to be tackled a lot more at the end of the day, the value and the insight of, of the data is in the metadata. A lot more lineage and tracking and really leveraging of metadata can be done across the enterprise. And it's not hard. It's a, it's a very hard problem to solve. And it's being solved in silos today. And, and it'd be interesting to see how that space images, and that gap closes, because at the end of the day, you know, if you access data to manage data, you kind of have to have access to the metadata first. And and I think that's something that will need to be solved in the short term. Yeah. And there's obviously others out there other gaps that, you know, will will emerge and innovation will continue. Yeah.
Tobias Macey
0:41:42
Well, thank you very much for taking the time today to join me and share your thoughts on the space of data orchestration and some of the ways that it can have some concrete benefits. So thank you for all of your time and effort on that and I hope you enjoy the rest of your day.
Dipti Borkar
0:41:56
Absolutely. Thanks so much. Bye bye.
Tobias Macey
0:42:04
listening. Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways that is being used. And visit the site at data engineering podcast. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Data Orchestration For Hybrid Cloud Analytics 1