Scaling Data Governance For Global Businesses With A Data Hub Architecture - Episode 123

Summary

Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of the goals of a data hub architecture?
  • What are the elements of a data hub architecture and how do they contribute to the overall goals?
    • What are some of the patterns or reference architectures that you drew on to develop this approach?
  • What are some signs that an organization should implement a data hub architecture?
  • What is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access?
  • What are the features or attributes of an individual hub that allow for them to be interconnected?
    • What is the interface presented between hubs to allow for accessing information across these localized repositories?
  • What is the process for adding a new hub and making it discoverable across the organization?
  • How is discoverability of data managed within and between hubs?
  • If someone wishes to access information between hubs or across several of them, how do you prevent data proliferation?
    • If data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity?
    • How are access controls and data masking managed to ensure that various compliance regimes are honored?
    • In addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs?
  • Given that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs?
    • How do you address issues of data loss or corruption within those transformations?
  • How is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.?
  • How do you manage tracking and reporting of data lineage within and across hubs?
  • For an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub?
    • What are some of the considerations and useful technologies that would assist in creating and connecting hubs?
      • Should the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface?
  • When is a data hub architecture the wrong approach?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances, go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers. You don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium global intelligence od sc into data Council. Upcoming events include strata data in San Jose and pi con us in Pittsburgh. Go to data engineering podcast.com slash conferences, to learn more about these and other events and to take advantage of our partner discounts to save money when you register today, your host is Tobias Macey. And today I'm interviewing Tim Ward about using an architectural pattern called Data hubs that allows for scaling data management across global businesses. So Tim, can you start by introducing yourself?
Tim Ward
0:01:35
Sure. My name is Tim ward. And I have had the pleasure of being on this podcast before. But my background is mainly in software engineering. And over the last six years, I've been focusing more in data engineering. So combined, that's about 14 years of working in the software space. I am based out of Copenhagen, Denmark. I have my weifare I have a little boy called Finn, I have a dog that looks like Chewbacca that called Seymour. So for those fans of Futurama, you'll know where the the reference for that dog is. And it looks exactly like see more out of the Futurama episodes. So that's me.
Tobias Macey
0:02:15
And so you mentioned that you've only recently been involved in data management. So wondering if you can just give a bit of a background in terms of how you got involved in that space?
Tim Ward
0:02:26
Yeah. So at a at a previous job, I was working for a quite a large vendor. It wasn't in the data space. But it turned out that the last project that I worked on there was, funnily enough, and how could we integrate our customers third party systems into our particular platform that we were building and turned out that majority of the team I was working with, we were all tasked with a very similar project, but with different customers. They had different obviously stacks and tools per customer. But essentially, we went through that rabbit hole of Okay, so looks like we need to buy a couple of different pieces here, we need a data integration piece. Okay, so I've heard from one of my other colleagues that they're investing in a data governance platform. Yep, that that makes sense for my customer as well. And after you go down that rabbit hole of the umbrella of data management, you've got, you know, potentially 10 different products that you've got to stitch together. And what got me into data management was that the trend really was every single one of these projects, by the way, they didn't go so well. Where they failed, was actually stitching these different products together. And in some cases, they were, you know, from different vendors. Not that not in my particular customer I was working with but with some of my colleagues, they were also you know, having the same struggles even with the same vendors products, just following the different things Or at least identifying and and going after the different pillars of data management. So I think what got me into this was that idea of all these old didn't go, Well, maybe there's something to this, maybe there's the need to investigate kind of more modern techniques to data management, because it didn't seem like this was a common request for most enterprise sized businesses anyway.
Tobias Macey
0:04:23
And so as I mentioned at the beginning, we're talking about some of the work that you've done in terms of using this data hub architecture. And I'm wondering if you can just start by giving a bit of an overview about the goals of what that architectural pattern is intended to achieve?
Tim Ward
0:04:40
Yeah, sure. I mean, I think I'll start with the fact that there's multiple different interpretations of what a data hub is, if, if you go to, for example, some of the analysts, analyst firms in our space, the way that they've interpreted is this concept of, they're making an analogy, just to say Like the subways in London, where, you know, you've got stations that take people from one side of London to the other, and, you know, the stops along the way where people can get off. But essentially, there's this whole spaghetti of, or hubs that are going around London to take people from one place to the other. And really what those analysts firms are saying is, well, you can kind of apply the same thing to data. Whereas instead of transporting people, you're essentially taking data from a source system and making it available in some type of target system, for example, your BI platform of choice, maybe your data science platform of choice, you name the tool, but wherever typically, you want to do something with the data when it's prepared. Now, our interpretation of of data hub is still similar, but I would say it has its differences. So really, the the goals of the data hub architecture came from a simple question that we had from one of our customers, which was We are a globally distributed business. We have a headquarters, we have regional offices, we have localized offices. And this is probably a similar case with a majority of enterprise businesses, especially those that are that are globally distributed. And the simple question was, I need to put in a data governance strategy and how on earth am I going to manage from a top down approach all the possible permutations of policies and rules and regulations for localized or regional offices, and also take care of the headquarter as well, and as you probably, you know, are aware as well, Tobias, what we typically do in the engineering field when we have a big problem like that, the first thing we look to is, how do we break this down into smaller challenges, and when you do that, you've got to have some way to be able to instrumented orchestrated, bring it together once those components have actually been broken down into smaller chunks. So in fact, you know, our interpretation, the way that we've been working with the data hub architecture, in with our customers and with our platform have really been focused around, and how on earth even for one of the pillars of data management, like data governance, how, how are you going to manage this from a top down approach with what could be localized data policies on what data they are allowed to have localized retention policies on how long they can have it for and then when you start to add in the other pillars of data management, like data quality, what are the quality levels or metrics for or KPIs for one region might be different to the other and this is what this is what forced us into explore a new way a new architecture to support this complexity of globalized rules and so
Tobias Macey
0:07:59
In terms of the actual architecture itself, what are some of the core elements and abstractions that are incorporated into it to allow for this global distribution of data with localized control? And what are some of the other patterns or reference architecture is that you drew on in order to arrive at this approach?
Tim Ward
0:08:22
Yeah. So I mean, our team have been working with many different types of technologies. One of the ones that we've been focusing on and we use quite heavily within our platform is a graph database. So this this really this concept of the network. And I think we've been spoiled by by working with it, because essentially, it office offers us the highest fidelity data structure that we have to be able to map data and playing with that. It kind of gave us some inspiration on what are the elements of this this hub architecture. And so the first obvious one would be your hubs. So In a graph world, this would be nodes. Those hubs are essentially responsible for managing as you put it, the localized rules. So they don't they're not interested in knowing what is global. Think about this. What does what does that other local hub? Think about this? What about my parent hubs, like a region hub? No, no, they want to manage the specific rules for their location. And so when you have this hub, you need, obviously some way for hubs to talk to each other. And that's where this concept of this element of a hub orchestrator comes in. And that's actually probably not too hard to see this analogy played out in other areas. A good example would be, we use Kubernetes to orchestrate our containers we use in the Apache world or open source world we use often zookeeper to orchestrate our different data stores or to keep In some type of sync between different environments, so there are many other ways that fact we use this type of, or at least it's available to us this day, this type of architecture patterns. So obviously, if you've got these hubs, the hub orchestrator is responsible for, okay, well, how do I talk from one hub to another, and there's a really good analogy I like to use here, which is travel. Now, when I travel from Copenhagen to the US, there is a border setup, which essentially says, For you to enter, you need to do these things. Well, first of all, you need a valid passport, okay? You potentially need a visa, you potentially need to have a work visa, for example, depending on what you're doing, and you're not allowed to be in any of these lists, such as you're not allowed to be on any blacklists or embargo lists that we have in our company in our country. And then If you check all those boxes, feel free to come in. And you can really start to apply that same kind of concept, but replace people with data. So how does data travels from one hub to another? Well, the localized hub sets up rules, such as if I'm going to talk with another hub, I need to make sure that the data completeness, or the completeness of records is at a certain level, maybe let's just for an example, say 90%. So if you want to share your records with my hub, you need to make sure your records are 90% complete. And if they're not, the whole, the hub orchestrator is responsible for saying, I'll give you tools to elevate that level of completeness, and whether that's plugging in enrichment services, whether that's, for example, if we're missing addresses, we might plug in something like the Google Places API, if we're missing Company data, we might plug in Dun and Bradstreet or open corporates or you know, name the local business registries that exists in a majority of the countries. But I'm going to give you that tooling so that you can enter. And so that type of pattern actually was a huge inspiration and reference for an architecture that worked in I guess I can't use the word nature but worked is already working in another part of the world now whether you'd like to think it works or not that that's completely up to you, I'm sure many people will complain about some travel into into certain countries and how hard it is and how often they're first. And I'm not speaking from experience and but essentially the concept. It works. And the great thing is, is often when you bring in concepts from the real world, and then you say, Okay, well let's put into a machine to handle this. A lot of the issues that you find with, for example, scaling humans to solve a problem, a lot of the time, we can actually remove those disadvantages when we move it into a much more technical product technical architecture. So there's some of the patterns that we drew upon for to develop this approach. But I mean, let's be honest, I've already mentioned two examples where orchestration is common, and that pretty much most enterprises are using and that's Kubernetes to orchestrate deployment of containers, and scalability and health checks and liveness checks. And then you've got tools like zookeeper that have been around for years and years and years to orchestrate and you know, sinking of environments, sinking of dictionaries and configuration, so it's not a fast stretch to actually see how this pattern really came together
Tobias Macey
0:13:57
and in terms of determining When it's worth going down this path of establishing this architecture of these various hubs and the orchestration between them and the rules that exist for determining what data is allowed to eat grass and in what fashion, what are some of the signs that an organization should actually go down that road and pay the complexity cost of actually implementing that and adding all these different constraints to their different localized data regimes? Yeah, that's a really
Tim Ward
0:14:27
good way to put it. Because, you know, what, one of the, one of the side effects or I guess, potentially a disadvantage of an architecture like this is, are you over engineering for the situation? You know, one of the big signs immediately would be, are you even a globally distributed business or you don't have to set up your hubs in, in a geographical architecture, it can break down by different components, but essentially, that's the commonality that we typically see with our customers. So the first thing you would need to ask yourself is that are you a globally distributed business? The reason why I use geography as an example is that often when we're talking about banks, insurance companies, they're regulated differently in different countries and locations. And therefore, the rules around what data they can have, how long they can retain it for, is a good common case to set up this type of architecture. So the next thing you've got to ask yourself is, are you actually operationally a complex business? And if not, this, I would argue is overkill. But it doesn't mean that we should necessarily shut the option off of growing your business. I mean, I guess that's really the goal of a lot of businesses is well, I want to expand into other locations. And typically that comes with complexity. So I want an architecture pattern that upfront, I don't need to subscribe to the complexity. But I also don't want to shut off that path. And the data architect, the data hub architecture, I would argue, has this designed into it. In fact, you can imagine that a globally distributed business really, that often growing new locations quite often, whether this is through mergers and acquisitions, or just natural growth of the business. And therefore really, the data hub architecture Not only is designed to help with this, but what we were not wanting is if we've got 300 of these hubs, as soon as we add another hub, we don't want to say, ah, ah, that's n factorial links that I have to sync up on rules. And you can imagine the complexity of how do these rules overlap, do they contradict each other, that's just never going to work? So the data hub architecture actually instinctively, and I think it's the use of the the graph network model that helped with this, you know, you can start small You can start with a one hub, the global headquarters and as you grow out, and you might say, Okay, well, we just have an on an office in London, I don't want to put in the over the infrastructure costs of setting up new technology and new infrastructure in a new and data center or in a new region in our cloud provider. Last, just shove that all within the head in the headquarters. Now, of course, what that means is that once you do move to that hub architecture, the complexity would be splitting that up. And that can be complex. So really, I guess the signs that you would need to look at is, is growth. Do you see it in the foreseeable future? And if so, maybe start out with the concept of not closing yourself off from an architecture like this.
Tobias Macey
0:17:53
And for an organization that already has an existing data platform, what are some of the migration steps entities that they might look to for being able to move into a data hub architecture and open up their data for being able to be integrated or distributed across these localized hubs. And then also, one of the other possible uses for a data hub architecture that I can envision is if there's some sort of a data collective and being able to share data across organizations or being able to open up certain datasets for public access.
Tim Ward
0:18:28
Yeah, yeah, good question. So I mean, this question reminds me of another thing that we've done in software engineering over the last few years, which is moving off the monolith moving off the monolith to microservices. And what what did we do? What did every business do? So essentially, what we did is we said, okay, let's break up the monolith. It's break it up into components, but then I need to put something in between as a kind of a measurer. I mean, one of the examples that was very common was an enterprise service. of us putting an enterprise service bus where we pass around discrete small little messages to so that these different components can kind of talk to each other in a little bit of a decoupled way. And so really, this is kind of similar to what the migration path looks like for moving off a classic data platform into a little bit more of us distributed governance environment. And I'll just reiterate, and governance is just one of those pillars, right. And I think the reason why I'm choosing that is because it's usually one of the first things that people want to do put in governance strategies in their business. And you can also just imagine that, okay, this can get really complex as I add the extra pillars of data management in, but essentially the kind of the kind of experience with the migration, as I alluded to before, it's a gradual move. It's it's really saying okay, and how we Gonna start splitting up this monolith. And typically the way we did it in software engineering was say, let's identify the main components of the platform. So we would split logging out to its own service, we would split, maybe jobs out to its own service, we would split something like data access, potentially out to its own service. Now, other architectures in in micro services would say, No, no, no, like, split the service out to have all of its components sitting behind it. So logging will have a database, it'll have a data layer, because that's all it's responsible for now, whichever way you did it, it's still kind of, it's still going to help in this example. But really the the approach that we've seen our customers take because majority of our customers started out with the model if they said, Okay, we've got one central hub of data, and it's got all of our data in one place. We've got all the governance rules in there. Oh, God. It I didn't realize how complex the governance rules were across the different businesses, there has to be some way to split this up. So I would say the migration path if you've done that before, in breaking up the monolith, it's very similar to that type of situation. Now to answer the second part of it, which is about making this data available to the rest of the business, whether this is just internally or potentially to the public, once again, there's there's multiple parts of that governance plays into it, of course, access control, and things like that play into it as well. But what the hub architecture basically does, like it does with the other components is it takes what is typically a top down approach, which is how from the my white ivory tower, do I set the overarching rules for my global business on how we share data. Now, one of the beauties of the hub architecture is that it instinctively has a hierarchy within it. So a graph is just a higher fidelity version of a tree. So for example, if I'm wanting that top down approach where global says, Listen, we have a global policy on sharing data, that we can't give out anything that's personal. So, in fact, the hub orchestrator can be responsible for saying, Ah, got it. So in fact, there's some policies that the local hubs don't have any control over. They've been in, they've inherited these rules from the global headquarters or from the regional hub, and they're enforced by default, on to the localized hub. Now, whether that localized hub can then say, I'd like to override those rules. That's kind of the responsibility of the hub architecture to, to know if that's even allowed or not. But what this means is that, at least the higher level parents can say listen You might have some extra ways that you want to share data in your local hub. But from a global business, we're handing down these rules, whether you've got rules to add on top of that, or override that sub for the hub orchestrator to actually figure out if those route rules can actually are compliant with each other or not.
Tobias Macey
0:23:21
And one of the other interesting things in this architectural approach is understanding what the useful topologies are, is it something where you would have a flat network where you have maybe every hub interconnected with each other or more of a hub and spoke model or something that's more along the lines of a DAG, where you might have different levels of linking between different nodes and different maybe sort of self contained sub networks that all interact with maybe a Hub Network and how the overall transit across those different hubs, but Particularly if you have a complex topology where there might be multiple different nodes of traversal, how that works into things like latency and issues of being able to discover and query across all of these different data repositories.
Tim Ward
0:24:15
It's one of the things that often comes when we separate data out when we, I mean, one of the ironies here is you could argue, did we just bring all the data in to then just silo it again. And of course, that's not the the kind of intention. In fact, the hub orchestrator really acts as a global Journal of what data is available across the entire network. And then, of course, will give you control in saying, well, hub one and hub two, they're allowed access to this, but if dictated, I can inform them that you know, there are these other datasets that are available, and all they need to do is really asked to be able to get access to that data. So essentially, the, the hub orchestrator is kind of like this, this transaction log or this journal of, I'm writing everything about every hub, and I'm going to orchestrate who has access to what, as you mentioned, in what hub, but also potentially in what subtrees of that network. And so some of these architectures can, or at least, these relationships can get pretty complex. What we're really recommending in this is to use this analogy I mentioned before of, of countries and cities and states and use that kind of way to build up this network. And in that way, the what the reason why this works is because typically, that's how a lot of businesses are distributed as well via locations and therefore it kind of makes it a little bit easier to manage. But I won't take away from the fact that yeah, these architectures could get could get quite complex in in how these relationships and how data is shared between them. So it's it's definitely I would say, one of the discovery endpoints with our customers on how far are they taking this and where the complexity is actually starting to arise, where initially, maybe the upfront, we wouldn't realize that these kind of things would come up.
Tobias Macey
0:26:22
And then another issue, particularly if you have a deeply layered topology is how you handle the transformations between hubs where they have different rules in terms of how the records should be represented or data quality or cleanliness issues, and being able to handle issues of potential data loss across those different nodes.
Tim Ward
0:26:45
Yeah, exactly. So I'll give you a couple of good examples. So and, you know, using that analogy before of like border control, it's a good way to kind of conceptualize what's happening here. So each hub is responsible for saying If you're going to talk to me, you need to have this checklist, right? Not the other way around. So it's not responsible for saying, hey, if I give my data to you, it needs to meet these standards. No, that's the part where it wouldn't scale because suddenly, you're taking a top down approach where you're saying, I'm now responsible for my own hub, but also every other hub in the network as well. So when it comes to things like data quality levels, when it comes to things like transforms, let's just take a simple example. You've got one hub that says, hey, my data that I have, it looks like this. It's a nested JSON object. And I'm, I've been instructed from the hub orchestrator to take data from London hub, and move it to or transfer it to the global hub. And so the global hub has a rule that says, No, no, I need flat data. That's what I'm wanting and the kind of, for lack of a better word, but the names of things properties or columns or attributes, whatever you want to call them, I need them to be called in a specific way, right? That's up for the hub orchestrated to figure out now, figuring that out is not the hardest part because essentially what you're doing is you're on entry or potential entry into your target, you're just analyzing a list of rules that are only local to that hub. So from a management perspective, you don't need to know about the whole world. But what you do need to do is then give those hub owners tools to say, Well, I'm not here to just tell you that we can't talk. It's my goal to give you tools to address that issue. And whether it's giving them tools to say, I'll give you a tool that transforms JSON, or nested JSON objects into flat representations. And maybe that's one approach. And I'll give you an example with the analogy I'm using you know, if I want to try into the US. It's not like I get there and someone says, I can't help you. Like, I don't, I don't know what to do for you to enter this country. They say, here is a form, fill it out. And that's how you get into this country. So really, it's about facilitating the hub orchestrator that identifies, like, there's an issue, there's a clash, we can't talk. And then there's the role of maybe maybe you could call it like a facilitator, which says, and here is the form you need to fill in to make it so we can talk.
Tobias Macey
0:29:34
And then to use an example, say you have Name field in a particular record. And in one region, it's just a flat text field, you put in whatever characters you want, whatever types of spacing or hyphenation but then in another hub, you have a name field that requires the first and last name, which obviously doesn't fit a number of different localities and their particular approach to how naming is handled. So how do you approach that type of challenge where one hub is enforcing just a flat text field that supports Unicode, the other one supports username, first and last name and maybe only accepts ASCII and being able to handle transformations. Is it a case where you can say that I refuse to make this transformation? And so this data won't be transmitted? And if so how do you signal that to the person who's trying to consume that data for a particular analysis?
Tim Ward
0:30:30
Exactly. It's a really good example. So first of all, I guess the first thing would be there would be an attempt by the hub orchestrator to say,
0:30:39
Okay,
0:30:41
I'm going to give you a tool that allows you to convert between character character sets. So there are some character sets that they have so low fidelity that you can't increase them. For example, if you've used a UTF eight encoding that ruins your accented characters like we have here. In Denmark, for example, then it's sometimes hard to reconstruct that using tools. So you're, you're you have already answered the question, which is, there are some times where it'll just say, No, I'm sorry, but we can't do this, right? There's no, I can't give you the tool to do it. you've either got to manually address it yourself. But I'm just going to reject this. I mean, using my analogy, game of travel, I'm sure there are many times where a person travels into a country. And when they get there, it's just, I can't help you. There's no form you can fill out. And it's just we're not compatible. I can't, I can't take data from you. And a good example for this would be, you know, I would imagine, and I have experienced, at least that most big large enterprises and actually expand this out to most businesses in general, probably want a global view of their data. They want to say yes, even though I'm across 20 countries need to do global reporting? So you could probably imagine the global hub would then say, okay, so all hubs below me start to send me your data. And the hub, Australia will then say, Okay, got it, I can go and help you with the looking up of that data and the transport. But the global hubs going to have some rules and policies set, a good example would be do quality scores. So I'm not going to accept anything below 90% completeness, I'm not going to accept anything below 95% accuracy. And if the data coming in, doesn't meet that, I will then give those hubs the tools to be able to get to that level, and that will take time. But essentially, if I'm going to accept that data, it needs to at least meet these levels. And you can probably imagine the same could be applied to things like data policies for governance or compliance such as if you send me data and it's personally identifying I'm just not going to take so I'm going to give you tools that allow you to not only identify why I'm not taking that data, but to be able to rectify the situation as well. So that's probably like a good example of a case where many times that communication or that orchestration that the global headquarters is asking for, you get to a point where it just says, Nope, I'm sorry, I can't help you. I will just downright reject that data.
Tobias Macey
0:33:27
And in this architecture, is it primarily just as a means of storing and communicating about and transmitting data? Or do the individual hubs also provide capacity for computation, where in the instance that we were saying of, for instance, the name where there is no way to cleanly convert between one representation to the other where you can push your analysis down to that hub to perform the computation that you want, and then return the results back to that and sort of a scatter gather approach?
Tim Ward
0:33:58
Yeah, I mean, I'll be Be honest, I haven't even thought about that. But when you think about the architectures that we use, including we use Kubernetes, as our orchestration framework for our containers and for deployment and all the goodness that comes with that. And so it's not hard, it's not a hard stretch to then say, Well, can all of those nodes just represent a node pool? And then can I use Kubernetes? Actually, just to register those as node pools and you know, if I need that extra resource, I could look out to Kubernetes. And to be honest, Kubernetes does this for us is say, and what pods are available. I've got someone demanding four gig of RAM and TCP use who's got it. And you could, in a way, utilize that as a farm to distribute out to or at least balanced the load of infrastructure and try to get the efficiencies on not only the infrastructure, but the costs associated with it. So I'm sure you you could be bend this to represent something that could could do something like that.
Tobias Macey
0:35:05
And another question about this data hub architecture that can potentially lead to challenges in data governance and management is the question of data proliferation, where if you're copying information between the different hubs, how do you track what records are being used and what locales and particularly in the case where you have an update to a set of information? How do you then replicate that across all the different areas where it has been copied to or in the event that a record is deemed false or in GDPR? where somebody has requested deletion of their data? How do you control handling removing that information from all the different places where it's been replicated to let's just start with this, that's it's really complex, right? It's a really complex challenge, because what we're doing is now we're just now we're really exploring what role I wouldn't even say edge cases but we're really pushing This architecture to its limit. So, you know, for example, including we have this idea of a mesh API, which essentially says, Hey, if, if I've pulled in all the data from all the operational systems, and then someone identifies that Tobias his job title is wrong.
Tim Ward
0:36:17
Well, because I know where it came from. I also have that lineage of, Okay, if I did change it, what records in the source systems need to be updated? Now, the way our mesh API works is in two ways, first of all, you can you can automate that process behind a workflow. So of course, you can imagine if I've got a recording from I don't know a CRM system like Salesforce, it has an ID associated with it. Great that's been ingested into our platform. And hey, I've got another record on Tobias from our HR system workday. It's got an ID as well. So in the occluded platform, you know, we've figured out that these are duplicates, let's merge So now I've got one record on Tobias, but two pointers to the source systems. So that's in one hub. So then really, to span the multiple hubs, essentially what you're doing is just adding extra layers of mapping. So instead of saying, Hey, I'm updating the job title of Tobias, and there are two source keys, one from Salesforce and one from workday, the need to be updated, what you're doing is you're just putting an extra layer of abstraction or misdirection in there, which is, Hey, I'm over here called the hub orchestrator, I actually know where all data is located. Now, each hub won't have access to this journal, right? They'll only be able to look up what they've got access to, essentially on the journal that says, Ah, got it. So these hubs are actually where I originally got that data from, so that essentially what you're doing is when you Integrating data, you're mapping up to some type of core, what we like to call vocabulary, or schema. And then on the reverse, you're unraveling that you're saying, cool. So my goal across the business is, I don't really care or want to know, if in Salesforce, we call it the job title. And if in workday, it's called job position, I want to map that up to kind of some type of standardized vocabulary or lexicon. And this is quite normal in the data governance area, the really when we're starting to write back or we need to delete records. So Tobias comes in and says, delete everything across all hubs. Essentially, what you're just doing is unraveling that vocabulary mapping that you've done back to the source hubs to delete those, those records and it's really hard. There's so many challenges involved. And I'll give you another good example. There's this concept of data sovereignty. which is essentially where is the data located. Now, it also will typically depending on who you talk to, they'll also include, okay, from a location perspective, where's that data also traveling. So obviously, the hub concept, or the network kind of instinctively has this in its design, that hub usually set up in the region that you're wanting to host that hub in. So in something like AWS, you would go in US East or us West, the same in Azure, for example. And what becomes complex with this is also these rules of sovereignty. And I'll give you one country that's quite complex, and that's Germany. So Germany is very strict on where it can host data. And so for example, if I was a hub in the US, and I said, Give me the data from Germany, you can't just easily or legally transfer that data to us service. So then the mesh framework So the data hub framework becomes really interesting. Because if we've got the backing of something like Kubernetes, from our infrastructure perspective that says, hey, if you need any more hubs, I can spin those up. That's just an extra node pool. And you know, your Helm charts have already given me instructions of what a node pool looks like, you can start to think of situations where you might spin up new hubs on demand and tear them down when the processing has been done. So for example, there might be a rule that says, Yeah, you can send data to Germany, but we can't do it the other way around. Got it. So the hub orchestrator knows that rule, or at least you map that into the hub orchestrator that say, that's the direction that these hubs can actually talk. So if global, let's just say that our base in US says I need to globally globally report, it says, not a problem, but the crunch the journey. data, you either need to use the German infrastructure to do that and send data from the US to the German infrastructure hub, or we've come up with other examples we've seen before, where maybe you have to spin up data in a sovereign country, ie, take something like Switzerland. So instead of us sending to Germany In Germany sent to us we spin up infrastructure in Switzerland, both sources send their data to Switzerland, that data is crunched, and potentially that hub is completely toned down. And as you probably already know, Tobias, this is something that Kubernetes does well, you know, spinning up different environments and tearing them down when when they're not necessary.
Tobias Macey
0:41:40
And for somebody who's actually interested in building out their own implementation of this data hub architecture, what are some of the technologies that are useful for being able to implement things like the hub orchestrator and the data storage layer and some of these automated transformations and what are some of the consider operations or edge cases that they should be aware of as they're starting to plan out some sort of deployment like that.
Tim Ward
0:42:06
Yeah, good question. So I'll start with this. Having a hub architecture in a traditional on premise environment is quite hard as you can imagine, and especially if we're doing location based hubs. And, you know, most enterprise businesses still, even though they might be in the cloud will quite often have a via an internal VMware cluster or some VM environment set up. And so one of the kind of technologies and bases would be that if you are a completely on premise business, the data hub architecture becomes a little bit more complex to achieve. Not unachievable, but much more complex. I've already mentioned a couple of the kind of core elements that are technologies that you would want to think about Cuban Eddie's Of course, and coming with that containers just in general is one of them. Why? Because you Might want to spin up different environments or as you used in your example that you could use the data architecture for is, Hey, I'm a hub in London, and I'm doing absolutely nothing. But Germany is processing data and it's overloaded, give me your work. And obviously, if those sovereignty rules are set up that we're allowed to do that we're allowed to transfer data between those apps, then fantastic. So the thing is that if you played with Kubernetes, before and, and containers, you're actually a majority of the way there, all we're really doing is applying the same methodologies, the same ideas to the hosting of data where a container is not necessarily a product. So we're not talking about a container like spark or Kafka. But actually the container is more you could represent that as a hub, where it's like, not just about technology, but it's also the data and it's the policies of how Different Docker containers work with each other. So what does that bring in? Isn't that why we have Docker compose Docker compose is there to say, this is how I compose my Hub Network. And then we use something like Helm to say, oh, and this is how all of the charts are set up for Kubernetes. So, you know, things like the amount of RAM and CPU that we allocate to the different nodes in the cluster or the pods. So then you really start to get into, ah, if I can understand that. It's really very similar to the data hub architecture itself. And then the other question of the hubs and how they relate to each other is whether you find it useful or necessary for each of the different hubs to be implemented in a homogeneous fashion where everything is using the same set of technologies or because of the abstraction layer of the hub orchestrator. Is it possible to have more of a heterogeneous implementation where each of them Different hubs can use the technologies that are most well suited for the particular teams who are maintaining them.
Tobias Macey
Tim Ward
0:45:09
being completely transparent, I have not even humored the thought. But when you when you think about it, if I use the technologies that I was talking about in the previous questions, you would argue that kind of if your application is built Well, you can argue that and the proper abstraction layers have been put in, you could say that, hey, at any point, if I want to switch out SQL Server, from my sequel, I should be able to do that which kind of plays into the role of gods. So if I've got a hub that set up with my own custom implementation, all I'm really doing is I'm using the hub architecture as interfaces, or as some type of schema or some type of agreement that I'm saying, Listen, to play in this network. To play in this hub. You need to think Fill in the following things, which is you need to tell me rules on data quality, you need to give me policies. And you can imagine that the hub architecture of a hub orchestrator that we use, it's very technology agnostic. It's essentially yamo files that have policies. So there's no vendor specific or technology binding. Besides that, I guess you could say that ubiquitous language or at least something that's not bound to a particular database type or a particular programming language. So then, then it makes it kind of a little bit easier to grasp the fact that these hubs don't need to be the same technology. Rather, they just need to add here to what the hub orchestrator is telling them to adhere to, so that it can do its job and tell these different systems how to do their job. Now, here's where it gets a little bit more complex is that it's also the role of the hub orchestrator to say, I'm going to give you the tooling so that we can talk, ie your data is not complete enough, your data is not accurate enough. And and so, you know, good luck with that. So then you would prop the question, okay, is it the job of the hubs to have those tools? Or is that the job of the hub orchestrator to provide kind of ubiquitous tools that also aren't necessarily technology specific. And I find that hard potentially to achieve? Because I would argue that if you're not binding yourself, at least to some consistent tooling around hubs talking to each other, you run the risk of saying there's 20 different ways to solve this challenge. And every time I want to talk to a new hub, I'm given a new tool to do it.
Tobias Macey
0:47:51
So when it feels like one of those things that needs to have more discovery around it, and maybe reading Usually the hub orchestrator is responsible for giving that consistent tooling. So hubs can chat. And when is the data hub approach the wrong idea? And when would it be easier to use just a different style overall, whether it's just a monolithic data lake or data warehouse or just some of the more traditional approaches to data management that we've been using for the past several years. So, you know,
Tim Ward
0:48:24
the funny thing is I run into we work with a lot of customers where even though they're large, and that can be measured in multiple ways. It could be revenue or employee count, but I'm, in this case, I'm talking to employee accounts. And you know, when they talk to us about their business, what some of the first things I'll say is, Tim, we're not actually a complex business, right? We were, for example, we are in facility services where we clean people's offices, or we, we do the catering at people's offices. So inherently, we might be nice. necessarily a complex business like shipping that requires lots of logistics and good timing and schedules and things like that. And so I would definitely argue that with any win that you take from engineering, you always take some losses. And so really, if you find that your business is not complex, something like this is is overkill over engineering. And of course, as I use the example before, I would argue, necessarily, the goal of business is to become more complex. It's definitely the goal of businesses to expand and grow and hopefully go into other countries. And there's inherently some complexity that comes with that. But at heart, if you find that it's you're not a complex business, there's not, you know, hundreds and hundreds of rules that change per region and maybe change in different departments. Something like this seems like it would be absolute overkill,
Tobias Macey
0:49:57
and are there any other aspects of the data architecture or some of the ways that it's being used or the benefits that it provides that we didn't discuss yet do you think we should cover before we close out the show? I think
Tim Ward
0:50:07
the one thing I'd like to just reiterate is that if you were to Google Data hub, there are multiple interpretations of it. And I think the reason why is because it's quite a broad term, you can apply it to multiple different things. And if you see our interpretation, you could have easily called it a data network, or that's a data mesh, or that's a data service mesh. And I would back that up, I would say, yeah, those are two other possible ways of interpreting what a data hub is for. So I would say when you're doing your exploration, and just be were wary of that people are interpreting this quite different.
Tobias Macey
0:50:46
And I believe that there's also a recent project out of LinkedIn called Data hub that they're using for their data discovery engine as well, just to muddy the waters a bit further.
Tim Ward
0:50:55
Yeah, exactly. And it makes sense doesn't it like the hub that's where I go to discover what that the data is and so Yeah, I agree. It's it's maybe a term that in two, three years, we'll all congregate on one interpretation of it. And I think that would help clear up the situation.
Tobias Macey
0:51:14
Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today.
Tim Ward
0:51:29
So I think education, education is a big thing. And here's why I say it is. I see a lot of businesses taking the traditional approaches to data. It makes sense, right? It's what worked at my last workplace, let's just replace and let's just do that same thing. Again, maybe with a different vendor. Maybe with a little bit more modern technology. One of the things I think is really missing is this education on the data management space itself. And of course, you've got to You've got great services like Udemy and Pluralsight, that are going out there to train. But I actually think we don't have enough education that's cracked this concept of Well, how do you solve some of the real complexities? like we talked about today? You know, we often see things like, Hey, here's how you build a pipeline. Here's how you migrate data from source to target. But what do you what you don't really see are the deep dives into yet let's go after the big challenges, because quite often, what we're hearing from our customers is I'm okay with the solving challenges of data if it's two systems, but I mean, that was me 50 years ago. I've got 200 systems. I've got 300 systems. I know education around how to solve those large data management challenges is where I think there's a there's a huge gap.
Tobias Macey
0:52:48
Well, thank you very much for taking the time today to join me and share your thoughts on the approach that you're using for being able to enable data management across global businesses and handle it These issues of regional compliance and the challenges of data sovereignty and things like that. It's definitely a big problem as you put and something that is worth exploring a bit deeper. So I appreciate all of your time and efforts on that front and I hope you enjoy the rest of your day.
Tim Ward
0:53:14
Perfect, thanks, Tobias.
Tobias Macey
0:53:21
Listening Don't forget to check out our other show podcast dotnet at Python podcast.com to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcast comm subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast comm with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!