Data Lakes

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Summary

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
    • What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
  • Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
  • Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
  • For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
    • How does HPCC Systems manage persistence and scalability?
  • What are the integration points available for extending and enhancing the HPCC Systems platform?
  • What is involved in deploying and managing a production installation of HPCC Systems?
  • The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
  • How does the Thor engine manage data transformation and manipulation?
    • What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
  • For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
  • How are you using the HPCC Systems platform in your work at LexisNexis?
  • Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
  • How is the HPCC Systems project governed, and what is your approach to sustainability?
    • What are some of the additional capabilities that are only available in the enterprise distribution?
  • When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
  • What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
  • What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
  • What do you have planned for the future of HPCC Systems?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:14
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity in the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register. Your host is Tobias Macey, and today I'm interviewing Flavio Villeneuve, about the HPC project and his work and his work at Lexis Nexis risk solutions. So Flavio, can you start by introducing yourself,
Flavio Villanustre
0:01:36
of course to be so my name is fluid ministry, I'm Vice President of technology and ca. So for Lex and XV solutions. We in the electronics solutions, we have a data platform called the HB systems platform, we have made it both from open source in 2011. And since then, I've been also as part of my role involved with leading the open source community initiative. ensuring that the open source community truly leverages the platform helps contribute to the platform, and certainly creating a liaison between the next Nexus solutions organization and the rest of the larger open source community.
Tobias Macey
0:02:19
And do you remember how you first got involved in the area of data management
Flavio Villanustre
0:02:22
thoroughly has been seamless, and it's probably in the early 90s been going through the database to the database analytics to data management to Data Integration? Keep in mind that even within Lex Nexus, we started the HP systems platform back earlier before year 2000. So back then we already had data management challenges with traditional platforms. And we started with this. And I've been involved since I joined the company in 2002, just with HPC. But before I've been in data management for for a very long time.
Tobias Macey
0:02:57
And so for the HPC system itself, can you talk through some of the problems that it was designed to solve and some of the issues that you were facing at Lexis Nexis that led to its original creation?
Flavio Villanustre
0:03:10
Oh, absolutely. So in Mexico, we solutions, we started with management, I say, our core competency back in the mid 90s. And as we go into risk management, one of the core assets when you are trying to assess risk, and predict outcomes is data. Even before people spoke about big data, we had a significant amount of data, mostly structured, semi structured data to but the vast majority structured. And we used to use the traditional platforms out there, whatever we could get our hands on. And again, this is old, back in the day before Hadoop. And before MapReduce was applied as a distributed paradigm for data management or anything like that. So databases, Sybase, Oracle, whatever was Microsoft SQL, data management platforms of initio information, whatever was available at the time. And certainly the biggest problem we had with a scalable, but was twofold one was the scalability, all of those solutions typically run in a single system. So there is a limit to how much bigger you can go vertically. And certainly, if you're trying to also consider cost affordability of the system. And that limit is that is much lower as well, right, there is a point where you go beyond what the commodity system is, and you start paying a premium price for whatever it is. So that was the first piece. So one of the one of the attempts of solving this problem was to split the data and use different systems but splitting the data, it creates also challenges around data integration, if you're trying to link data, surely you can take the traditional approach, which is you segment your data into tables. And you put those tables in different databases, and then use terms of the foreign key to join the data. But that's all good and dandy as long as you have a foreign key that is unique, handheld is reliable. And that's not the case with data that you acquire from the outside. If you didn't read the data, you can have that if you bring the data from the outside, you might have a record that says these records about john smith, and you might have another record that says this liquid Mr. john smith, but do you know for sure he does. Two records are about the same john smith. And that's, that's a Lincoln problem. And the only way that you can do Lincoln effectively is to put all the data together. And now you have a we have this particular issue where in order to scale, we need to segment the data, in order to be able to do what we need to do, we need to put the data in the same data lake as a dome team. Later, we used to call this data land, eventually we teach it term in the late 2000s. Because data lake become became more more well known. So at that point, the potential bats to overcome the challenge where well, we either split all of the data as we were before, and then we come up with some sort of meta system that will leverage all of these 3d data stores. And potentially, when when you're doing prolific linkage, you have problems that are in have the computational complexity always square or worse. So that means that we will be a significant price and performance but potentially can be done if you have enough time and your systems are big enough, and you have enough bandwidth in between the systems. But the complexity you're gaining from a programming standpoint is also quite significant. And
0:06:33
some things you don't
0:06:34
have enough time some things you get data updates that are maybe hourly or daily. And the doing this big linking process may take you weeks or months if you're doing this across different systems. So and the complexity in programming, this is also pretty significant factor to consider. So at that point, we thought that maybe better approach to these was to create them. So defend an underlying platform to apply this type of solutions to problems with algorithms in a divide and conquer type of approach, we would have something that would partition the data automatically. And that will distribute the data in partitions into different commodity computers. And then we would add an abstraction layer on top of it that would create a programming interface that gave you the appearance that you are dealing with a single system with a single data store. And whatever you coded for that data store would be automatically distributed to the underlying partitions. We would also because of the way the hardware was fighting slower than it is today, we thought that a good idea would be to move also as much of the algorithm as we could to those partitions rather than executing the centrally. So instead of bringing all of the data to a single place to process these, which the single place might not have enough capacity would be to do as much as you can for a couple of brief field operation or a distributed grouping operation or through the filtering operation across each one of the politicians. And eventually, once you need to do the global aggregation, you can do it centrally. But now with a far smaller data set that is already pre filter. And the time came to define how to build abstraction layer. The one thing that we knew about was SQL as a programming language. And we said, well, this must be something that we can track with SQL as a permanent interface for our data analysts. But they work with us quite used to a data flow model because of the type of tools they were using before. Things like a couple of an issue where the data flows are these diagrams where your notes are the operation. So the activities you do and the data and the lines connecting the flow, the activities represent the data traversing those. So we thought that a better approach than SQL would be to create the language that a gave you the ability to build this sort of data flows in that system. That's how easy it was born, which is the language that runs on WHVZ and HPC.
Tobias Macey
0:09:05
So it's interesting that you had all of these very forward looking ideas in terms of how to approach data management, well, in advance of when the overall industry started to encounter the same types of problems as far as the size and scope of the data that they were dealing with that led to the rise of the Hadoop ecosystem, and the overall ideas around data, lakes and MapReduce and some of the new data management paradigms that have come up. And I'm wondering what the overall landscape looked like in the early days of building the HPC system that required you to implement this in house and some of the other systems or ideas that you drew on for inspiration for some of these approaches towards the data management and the overall systems architecture for HPC?
Flavio Villanustre
0:09:52
That is a great question. So it's interesting, because in the early days, when we told people what we were doing, they will look as bad often asked, Well, why don't you use database x, y, z, or data management system XYZ. And the reality is that none of those would be able to cope with the type of data, frequent data process, they wouldn't offer the flexibility of the process, like this probabilistic record linkage that we that I explained before that we do, and certainly good an offer in seamless transition between data management and data forwarding, which was also one of the important requirements that we had a time, it was quite difficult to explain to others why we were doing this, and what we were gaining by doing this. So map and reduce operations as, as functional programming operations have been around for a very long time since the list Lyft days in the 50s. But the idea of using map and with us as operations for the data management didn't get published and build this, I think was September December 2004. And I remember reading the original paper from the Google researchers thinking, well, now someone else has the same problem. And they got to do something about it. Even though at the time we were already we already have HPC. And we already had the CL so it was a perhaps too, too late to go back and try to re implement the data management aspects and the and the programming layer abstraction on HPC. So just for those people in the audience that don't know much about the CL, again is this is all open source or open source and free Apache to license and there are no no strings attached. So please go there and look at it. But in summary, ECL is a declarative Dataflow programming language. And not unlike declarative manner, what you can find in an SQL or functional programming languages, Haskell maybe wait of Bremen, Lisp and closure and other permanent oh there. But if data flow, from their standpoint is closer to something like TensorFlow, if you are familiar with TensorFlow as deep learning, programming paradigm, and framework, so where you could data operations that are primitive, that our data primitives, like for example, sort, you can say sort data set by this column in this order. And then you can add more modifier if you want, you can do a join across data sets. And again, the abrasions join, and you can do a roll up operation and operations name roll up. All of these are high level operations, you define them in your program. And in a declarative programming language, you create definitions, rather than assign variables. For those that are not familiar with declarative programming. And so many are in this audience. The collective programming has, for the most part, the property of having immutable data structures, which doesn't mean that you cannot do valuable work. And you can do all of the work the same way or better. But it gives get gets rid of side effects and other pretty bad issues with a more traditional immutable data structures. So you define out to you to define things, I have a data set that has a phone book, and I want to define an attribute that is this data set, filter by a particular variable. And then I might define another attribute that uses the filter data set to now group it in a particular way. So at the end of the day, any single program is just a set of definitions that are compiled by your compiler. And these compilers is yelling to see which then men reality c++, which then goes into the c++ compiler of the system, this is your plan or whatever you have and generate assembly code. And that that is the goal that is run in the platform. But the fact that you feel is such a high level programming language, and the fact that is declarative means that the CL compiler can take decisions that otherwise more imperative type of programming language wouldn't allow the compiler to take the compiler in a declarative programming language. And functional languages is also in case knows the ultimate goal of the program, because the problem is, in some ways, is a morphic to an equation. And you could even line from a functional standpoint, every one of your statements into a single massive statement, which you of course, can do from a clinical standpoint. But the compiler can now for example, do things like apply non strictness, if you put a statement, if you made a definition that is never going to be used, there is no point for that definition to be even compiled in or executed that all that saves performance equal. If you have a conditional fork in a place in your in your code, but that condition is always met or never met, then there is no need to compile the other branch a all of these gives you performance implications that can be far more significant. When you're dealing with big data. One of the particular optimizations can be around data and calculation, it is a lot far, it's far more efficient than a lot faster, if you are going to do similar operations to every legislator said to combine all of those operations, and do only one person to data with all the abrasions if it's possible. And they combine laser compatible as exactly that. And, and takes away a little bit of the perhaps flexibility on the programmer by making it far more intelligent at the moment, it's compiled. Of course parameters can tell the compiler I know better and forced to do something that may be otherwise unreasonable. But a just an example. You could say, well, I want to sort this data set and I then I want to filter it out and get only these few records. And if you say that in that order there, a an embedded the programming language would first sort and sort of even in the optimal, most optimal case is it's an N login type of operation and condition of complexity, and then fill it out and get only a few records out of it, when the optimal situation would be to filter it out first, and get those few records and then sort those records and ACL competitors. exactly that.
Tobias Macey
0:16:01
The fact that the language that you're using for defining these manipulations ends up being compiled. And I know that it's implemented and C and c++, both the ACL language itself as well as the overall HPC platform is definitely a great opportunity for better performance characteristics. And I know that in the comparisons that you have available for going between HPC and Hadoop, that's one of the things that gets called out. And as far as the overall workflow for somebody who is interacting with the system using that ECM language. I'm curious if the compilation step ends up being in any way a not a hindrance, but a delaying factor as far as being able to do some experimental iteration or if there is the capability of doing some level of interactive analysis or interaction with the data for being able to determine what is the appropriate set of statements to be able to get the desired end result when you're building an ACL data flow?
Flavio Villanustre
0:17:05
Nice. Another great question, I can see that quite diverse
0:17:10
and programming. So you're right, the fact that the seal is compiled means that just again, for for the rest of the audience, we have an integrated development environment policy, like the and of course, we support other like Eclipse and Visual Studio and all of the standard ones, but I'll just talk about it, feel it because it's what I mostly use. In that case, when you write code, you write the ATL code, and then you do, you can certainly run the test of the Golden but if you verify that that gold is, is correct, syntactically, but at some point, you want to run the gold because you want to get it in, you didn't want to know if semantically makes sense, and it will give you the right results. Right so and running the gold we go through the compilation process, depending on how large your code bases, certainly the competition process can take longer. Now the compiler does know what can be modified. Remember, again, a Felisa declarative programming language. So if you haven't touched a number of attributes, and again, data structures are immutable, and add to use the DOM change, since there are no side effects should be exactly the same. So the fact that a when you define a function, that function cause referential transparency, that means that if you call the function at any time, it will give you the correct result, or the same result based on the parameter and just based on the parameter that you're passing with that the compiler can take some shortcuts. And if you are re compiling some bunch of UCL attributes, but you haven't done too many of them, it will just use the pre compiled code for those and only compile those of you have changed. So the completion process, when you are dead, delicately working on code tends to be fairly quick, maybe a few seconds, of course, you depend on having any car company find it available. Traditionally, we used to have a centralized approach to the Excel compiler, when it would be one or a few of them running in the system, we have moved to a more distributed model where when you deploy your refill ID and you refill tools in your workstation, there's a compiler that goes there. So the field completion process can happen in the workstation as well. And that gives you the ability to have it available for you at all times when you're trying to use it. The one of the bottlenecks was at some point before, when you were trying to do this quick adaptive programming approach to things and the compiler was being used by someone that was compiling a massive amount of PCs from some a completely new job, and may have taken minutes and you were does they are sitting, picking your nose waiting for the compiler to to finish that one completed. By the way, the time to compile this is an extremely important consideration. And we continue, we improved the compiler to make it faster. We we have learned you can imagine over a bit. By the way, some of the same core developers have developed the CL compiler governing holiday, for example, have been with us since the very beginning they he was one of the core architects became the initial design of the platform. And he's still the lead architect that is developing that ECM compiler, which means that a lot of the knowledge that has gone into into the compiler process and optimizing it is still getting better and better. Of course, now with the larger community working on the compiler and and more people involved and more documentation around it means that others can pick up where he leaves. But hopefully he will be around and doing this for a long time. And making sure that the compiler is as Justin time as it can be is is very there is no at this point interpreters for ECL. And I think it would be quite difficult to make it completely interactive where the point where you submit just a line of code and does something because of the way a declarative programming paradigm works, right.
Tobias Macey
0:21:17
And also, because of the fact that you're working most likely, with large volumes of data distributed across multiple nodes, being able to do a rebel driven development is not really very practical, or it doesn't really make a lot of sense. But the fact that there is this fast compilation step in the ability to have a near real time interactivity, as far as seeing what the output of your programming is, it's good to see particularly in the Big Data landscape, where I know that the overall MapReduce paradigm was plagued in the early years by the fact that it was such a time consuming process to submit a job and then see what the output was before you could then take that feedback and wrap that into your next attack. And that's why there has been so many different layers added on top of the Hadoop platform in the form of pig and hive and various sequel interfaces to be able to get a more real time and interactive and iterative development cycle built in.
Flavio Villanustre
0:22:14
Yeah, you're absolutely right there. Now, one thing that I haven't told the audience yet is how the platform looks like mine. And I think that this we are getting to the point where it's quite important to explain that there are two main components in the HPC systems platform, there is one component that does data integration, these these massive data management engine equivalent to your data lake management system, which is called for for is meant to run one PCL work unit at a time which a What can it can consist of a large number of abrasions and many of them are running parallel Of course, and there is another one which is known as Roxy which is the data delivery engine there is one which is a sort of a hybrid called AH for now Roxy an H store both are designed to operate in 10s of thousands or more operations at the same time simultaneously, for is meant to do one work unit at a time. So, when you are developing on for even though your completion process might be quick, and you might run on a small data sets quickly, because you can execute this work in it on those little data sets using For example, h4, if you are trying to do the data in large data transformation of a large data sets in your phone system, you still need to go to the queue in that for and you will get your time whenever it's due for you right, surely you can we have priority, so you can jump into a higher priority queue and maybe you are you can be queued after a the just the current job. But before any other future jobs, we also partition jobs into smaller unit. And those smaller units can be also segmented, they are fairly independent from each other. So we could even interleaved some of your your jobs into in between a job that is running by getting into each one of those segments of the of the work unit. But if they get active in this there is a little bit less than a than optimal, but it is the nature of the basis because you want to have a large system to be able to process this throughout all the data in a relatively fast manner. And if we were trying to truly multi process they are most likely many of the resources available, good suffer, so you may end up paying a significant overhead across all of the president or running in parallel. Now. I did say that full run only one working at a time. But that was a little bit of a lie. That was really a few years ago. Today, it does run you you can define multiple QC in a store. And you can make run 34 then work units, but certainly not thousands of them. So that's a that's a big difference between that and Roxy, can you run your work in it and Roxy, yes, or in each floor. And they will run concurrently with anything else that is running with almost no limit their thousands and thousands of them can run at the same time. But there are other considerations on when you run things on Roxy or H store versus in floor. So it might not be what you really want.
Tobias Macey
0:25:29
Taking that a bit further, can you talk through a standard workflow for somebody who has some data problem that they're trying to solve and the overall lifecycle of the information as it starts from the source system gets loaded into the storage layer for the HPC platform. They define an ACL job for that then gets executed and Thor h store and then being able to query it out the other end from Roxy and just the overall systems that get interacted with each other rage about data life cycle.
Flavio Villanustre
0:26:01
co I love to so very well let's let's set up something very simple. As an example, you have a number of data sets that are coming from the outside, you need to load those data sets into HPC. So the first operation that happens is something that is known as spray spray is simple process is an spray comes from the concept of spray painting the data across the cluster, right. So this runs on a Windows box or a Linux box and it will take the data set, let's say that your data set is just given number in million records long. It will unusual as it can be in any format, CSV or or any other or fixed length limited or whatever. So it will look at your data total data set, it will look at the size of the four cluster where the data will be saved initially for processing. And let's say that you have a million records in your data set and you have MN nodes on your for let's just make round numbers and the small numbers. So it will a petition the dataset into 10 partitions because you have to note and it will a then just copy transfer each one of those partitions to the corresponding to full node This is done. If it can be better lies in some way, because for example, your latest fix link, it will automatically use pointers and paralyze this if the data is in either no and XML format or in the limited format where it's very hard to find the partition points, you will need to do a pass in the data, find the friction points and eventually do the panel copying to the thought system. So now you will end up with 10 partitions of the data with the data in no particular order, the Netherlands, all of them that you had before, right. So the first 100,000 records will go to the first note the second 100,000 Records, we go to the second node and so on so forth until you go to the end of the data set this put each one of the nodes in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is spread or
0:28:10
while the data has been sprayed. And depending on the length of the data,
0:28:13
or or even before year, you will most likely need to arrive at work you need to work on the data. And I'm trying to do this example in a way that he said that data The first thing you see that data. So otherwise, all of these automated, right, so you don't need to do anything manually. All of this is scheduled and automated. And working that you already had will run on the new data set and will have appended or whatever it needs to be done. But let's imagine that is completely. So now you write your work unit. And let's say that your latest said was a phone book, and you want to first of all, and even a duplicate, build some rollout views on the phone book. And eventually you want to allow the users to run some queries on a web interface to look up people in the phone book. So you and let's just for the sake of an argument argument, let's say that you're also trying to join that phone book with your customer contact information. So, you will right they will connect that it will have that join to merge those two, you will have some duplication and perhaps some sort of thing. And after you have that you will need to build you will want you don't need to but you will want to build some keys. There is another again, key build processing for the oldest runs on for that will be part of your work unit. So essentially, it's all the CO writer working with ECL submit their work unit, they still will be compile will run on your data, hopefully, they feel will be syntactically correct when you submit it. And it will run with giving you the resource that you were expecting on the data. You see. I mentioned this before, but he says surgical type language as well, which means that it is a little bit harder to errors that will only appear in runtime between the fact that he has no side effects. And that is typically typed most typing errors, type errors they've made in errors and they might into function operations errors are a lot less frequent. There is not like Python, but you may
0:30:17
seem okay. The
0:30:20
run may be fine. But then one run at some point it will give you some we are there because a variable that was supposed to have a piece of text has a number to revise the verse. So you run the work in it, they will give you the result as a result of this work unit will give will potentially give you some statistics and the data some metrics. And he will give you a number of keys, those keys will be also partitioned in four. So there will be filtered nodes, the keys will be partitioning them pieces in those nodes. And you will be able to play those keys as well from for Joe, you can write a few attributes that can do the quoting there. But at some point you will run to you will want write those queries for Roxy to us. And you will want to put the date and Roxy because you don't have one user creating the data you will have a million users going to query that data and perhaps a 10,000 of them will be simple things liquidating. So for the process, you write a another piece of ECL another sort of work in it, but we call this query and you submit that to Roxy instead of four. And there is a slightly different way to submit it to Roxy. So you select Roxy and you submit this, the difference between this query and they work in it I do the heat you have in four is that the query is parameter raised and similar to paradise to proceed in your database, you will find some variables that are supposed to be coming from the front end from the input from the user. And then you just use the values and those variables to run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from for i said before the keys are not mandatory, Roxy can perfectly work without keys can even cast a way to work with in memory distributed data sets as well. So even if you don't have a key, you don't pay a significant price in they look at by doing the sequential look up on the data and the full table scans of your database. So you submit that to Roxy, when you submit that query to Roxy, Roxy will realize that it has the data and it's not in Roxy's in for and this is also your choice, but most likely you will just tell Roxy to load the data from for it will know what to all the data from because he knows what are the keys are and what the names of those keys are, it will automatically load those keys. And also your choice to the Roxy to a stair allowing users to query the front an interface or to a while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically exposed on the front end there is a component called ESP, that component called DSP exposes a web services interface. And this gives you a restful interface, a soap interface, JSON for the payload, if you're going from the restful interface, even AM an old EBC interface if you want. So you can have unit even SQL and front on the front end. So the moment you submit the query, the query automatically generate out to generates these, all of these web service interfaces. So automatically, if you want to go with a web browser on the front end, or if you have an application that can use I don't know a restful interface over HTTP or HTTPS, you can use that and it will automatically have access to that Roxy quality that you submitted, of course, a single Roxy might have not one query but 1000 different queries at the same time, a all of them leasing an interface, and it can have several versions of the same of the queries as well. The queries are all exposed version from the front end. So you know, what they use is an accent. And if you are deploying a new version of equity or modified and extinguish it, you don't burn your users, if you don't want to, you give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process. Now, as you have committed to these while you need to have automation, all of these can be fully automated. In ECL, you may want to have data updates. And I told you data is immutable. So every time you think you're mutating data, you're updating data, you're really creating a new data set, which is good because it gives you full provenance, you can go back to your everyday version, of course, at some point, you need to delete data, or you will run out of space. And that can be also automated. And if you have updates on your
0:34:36
data, we have concepts like super files where you can apply updates, which are essentially new overlays from the existing data. And the existing work unit can just work on that, happily as if it was a single data set. So a lot of these complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So the developers if they don't want to see the underlying complexity, they don't need to, if they do they have the ability to do that I mentioned before Well, ECL will optimize things. So if you tell it, do this, join, but before doing the join to the sword, well, you may know that it is to us or to the sort of won't be that. But a if you know that your latest resorted, you might say, well, let's not do this, or I want to do this join each one of our politicians locally, instead of a global join, and order they are the same thing with sort of disorder operation and ECL of course, if you tell it to do that, and you know better than than the system, you see, I will follow your orders. If not, it will take the safe approach to your operation. Even if it's a little bit more overhead. Of course,
Tobias Macey
0:35:47
a couple of things that I'm curious about out of this are the storage layer of the HPC platform and some of the ways that you manage redundancy and durability of the data. And I also noticed when looking through documentation that there is some support for being able to take backups of the information, which I know is something that is non trivial when dealing with large volumes. And also on the Roxy side, I know that it maintains an index of the data. And I'm curious how that index is represented and maintained and kept up to date in the overall lifecycle of the platform.
Flavio Villanustre
0:36:24
Those are also very good question. So in the case of for for him Cassie concept. So we need to go down to a little bit of a system architecture. So in Thor you have each one of the nodes that handle a primarily they are chunk of data, they are partition of the data. But there is always a body node, some other node that has also their own partition, but they have a copy of the partition of some other nodes. If you have 10 nodes in your cluster view your node number one, I have the first partition and my have a copy the partition that no den has no number two might have a partition number two, but also might have a copy of the partition that no no number one has, and so on so forth. every node would have one primary partition and one backup partition of the other nodes every time you run a work unit. He said that you did he mutable, but you are generating a new data set every time that you are materializing data on the system, either by forcing it to materialize or a by letting the system materialize the data when it's necessary. And the system tries to stream as much in this way similar more similar to spark or or TensorFlow where the data can be streamed from acuity to acuity without being materialized. And like my previous and at some point, he decides that it's the time to materialize because the next operation might require materialized data or because you've been going for too long with data that if something goes wrong with the system will be blown up with every time it materializes data, the lazy copy happening with a new data has materialized to these backup nodes. So surely there is there could be a point where if something goes very wrong, and one of the nodes dies and the data in the disk is corrupted, but you know that you have always another know that has ad copy. And the moment you replace you do with known as Khufu essentially pull it out put another one in the system will automatically revealed that missing partition because it has complete redundancy of all of the data partitions in all the different nodes in the case of Roxy. So in the case of Florida seems to be sufficient, there is of course, the ability to do backups. And you can backup all of these partitions which are just files in the Linux file system. So you can even back them up using any Linux backup utility or you can use HPC to backup for you into any other system you can have cold storage, some of the problems is what happens is where your data center is compromised. And now someone modified or destroyed the data life system. So you may have you may want to have some sort of offline backup. And you can all handle this in the normal system backup configuration, or you can do it the HPC and make it offloaded as well. But for Roxy, the redundancy is a even more critical in the case of for when a node dies, it is sometimes less convenient to let the system work in a degraded way. Because the system is typically as fast as the slowest node. If all nodes are doing the same amount of work, a process it takes an hour will take an hour. But if you happen to have one know the die that now there is one know that he's doing twice the work because he has to do deal with two partitions of data its own and the backup of the other one, the time to the process may take two hours. So it is more convenient to just stop the process when something like that happens. The note and let the system rebuild that note quickly and continue doing the processing. And that might take an hour and 20 minutes or 10 minutes rather than the two hours that otherwise you would have taken. And besides if a system continues to run and your drive your storage system died in one knows because it's old and there is a chance that either the storage systems, when they get under the same stress will die the same way you want to replace that one quickly and have a copy. As soon as you can do not run the risk that you lose two of the of the partitions. And if you lose two partitions that are in different nodes that are not the backup of each other, that's fine. But if you lose the primary node, and the backup node for that one, there is a chance that you may end up losing the entire partition which is which is bad. Again, bad if you don't have a backup and Leland returning back of some things next time. So it's it's also inconvenient. Now and the Roxy case, you are you have a far larger pressure to have the process continue. Because your Roxy system is typically explosive all to online production customers that may pay you a lot of money for you to be highly available.
0:41:06
So Roxy allows you to have define the amount of realness that you want. based on the number of copies that you want, you could say, well, I haven't been a Roxy and as as need, which is the default, a one copy of the data or I need three copies of the data. So maybe they copy the partition, we know the one will be will have a copy in two, or three and four, and so on so forth. Of course, you need four times the space. But you have a far Hager resilience, if something goes very wrong, and Roxy will continue to work, even if a nose is down or you know, top down or, or as many notes as you want that down as long as you have the data is still fine. Because worst case scenario if even if it was a partition completely Roxy Mike, if you want to continue to run, but he won't be able to answer any queries that we're trying to leverage that particular partition that he's gone, which is sometimes not a good situation, when you you ask about the format of the keys and the formatting, they have the keys of the indexes in Roxy is interesting, because those keys, which is again, typically the format of the data that you have in Roxy, for the most part, you will have a primary key, these are all keys that are multi field like in a normal decent database out there. So they have multiple fields a they go, typically they those fields are all over by cardinality. So the fields with the larger cardinality will be at the front to make it more better performing. It has interesting abilities, like for example, you're going to step over a field that you don't have, you have a Wildcat for and still use the remaining fields, which is not something that normally a database doesn't do. Once you have a field that you don't have a value to apply, the rest of the fields on the right hand side are useless. And those Mexico's other things that are quite interesting there. But the way the data is stored in those keys is by decomposing those keys into two components, there is a top level component that indicates which node will have that partition. And there is a bottom level component, which indicates Where in the hell drive they have that a of the node, the specific data elements or the specific block of data elements are. So by decomposing the keys in these two hierarchical levels, it means that every node in Roxy can have the top level of that which is very small. So every node can know where to contact the specific values. So every note can be quoted from the front end, you have now a good scalability on the front end, you can have a load balancer and load balance all of the nodes. And it still on the back end, they can go back and know which node to ask for this when I said that the bottom level has the specific partition, I lied a little bit because he's not been no number one uses multicast. So nodes, when they have a partition of the data they subscribed a multicast channel, what you have in the top level is the number of the multicast channel that will handle that partition that allows us to make Roxy nodes more dynamic and handle. Also the fault tolerance situations where nodes go down. Well, it doesn't matter if you send the message to a multicast channel. Any know that is correct, we get the message, which one to respond well, he will be there faster note they know that is less burdened by other queries, for example. And if any know dies in the channel, it really doesn't matter. You're not stuck in a TCP connection waiting for the handshake to happen because they know the wind the way it is UDP, you send the message, and you will get the response. And of course, if nobody responded in a reasonable amount of time, you can resend that message as well,
Tobias Macey
0:44:53
going back to the architecture of the system, and the fact that how long it's been in double element and use and the massive changes that have occurred at the industry level as far as how we approach data management and the uses of data and the overall system that we might need to integrate with. I'm curious how the HPC platform itself has evolved accordingly. And some of the integration points that are available for being able to reach out to or consume from some of these other systems that organizations might be using,
Flavio Villanustre
0:45:26
we have changed quite a bit. So even those HPC systems name and some of the code base is resembles what we have 20 years ago, as you can imagine, any piece of software, he's a living living entity, and changes and evolved under that I've got as long as the communities active behind us, right. So we have changed significantly, we have not just added functionality, core functionality of HPC or change the functionality I had to adapt to times, but also build integration points. I mentioned a spark for example, and spark. Even though HPC is very similar to spark. spark is a large community around machine learning. So it is useful to integrate with the spark because many times people may be using spark ml. But they may want to use HPC for data management. And having a proper integration where you can run a spark ml and have on top of HPC is something that can be attracted to a significant amount of the HPC open source community. In other cases, like for example, Hadoop and HDFS axes are the same way integrations with other programming languages. Many times people don't feel comfortable programming everything in the CL and ECM works very well for a Data Manager something that is a data management centric process. But sometimes you have little components in the process, for example, that cannot be easily expressed in ECL is not in a way that is efficient.
0:46:55
And I don't know, I'll just throw one little unit together unique, unique ideas for things and you want to deny this you it is unique IDs in a random manner like UUIDs.
0:47:06
Surely you can call this and ECL, you can come back and come up with some crafty way of doing UCL. But he would make absolutely no sense to go to Denise EL, to then be compiled into some big chunk of c++, when I can go to directly in C or c++ or Python, or Java, or JavaScript. So being able to embed all of these languages into ECL became quite important. So we built quite a bit of integration for embedded languages is back even a few very major versions ago a few years ago, we added support for a I mentioned some of his language already Python, Java, JavaScript. And of course C and c++ was already available before. So people can add this little snippet songs functionality create attributes that are just embedded language type of attributes. And those are exposing CLS if they weren't UECO primitives. So now they have the ability of this and expand the ability of the core language to support new things without need to write them in a CL natively every time. And other there are plenty of other enhancements as well on the front end side. So I mentioned ESP ESP is this front end access layer, think of it as a some sort of message box in front of your Roxy system. In the past, we used to require that you code your ACL query for Roxy. And then you need to an ESP source recorded in c++. So you need to go to ESP and extend ESP with a dynamic model to support the front end interface for that query, which is twice the work. And you require someone that also knows c++ know just someone that knows ECL. So we change that. And we use something now that is called dynamic ESDL. That outdoor generates, as I mentioned before these interfaces from ESP, as you go this DCECL, all they want, they'll expect that you will put it there, you will call the query with some permitted eyes interface to a query. And then automatically GSB will take those parameters and expose those in this front end interface for for users to consume the decade, we also have done quite a bit of integration in systems that do that can help with the benchmarking of HPC. availability, monitoring, and performance monitoring all of the capacity planning of HPC as well. So we are we try to integrate as much as we can with our components in the open source community. We truly love open source projects. So if there is a project that already has done something that we can leverage, we try to stay away from reinventing the wheel every time we use it. If it's not open source, if it's commercial, we do have a number of integration with commercial systems as well. We are not to relate, we are not religious about it. But certainly it's a little bit less enticing to put the effort into something that is closed source. And again, we we believe that the model in open source, he says it's a very good model, because it gives us It gives you the ability to know how things have done under the hood and extended and fixed them if you need to. We do this all the time with our projects. We we believe that it has a significant amount of value for for anyone out there.
Tobias Macey
0:50:26
On the subject of the open source nature of the project, I know that it was released is open source. And I think you said the 2011 timeframe, which posts dates when Hadoop had become popular and started to accrue its own ecosystem. I'm curious what your thoughts are on the relative strength of the communities for Hadoop and HPC. Currently, given that there seems to be a bit of a decline in Hadoop itself as far as the amount of utility that organizations are getting from it, but also interesting in the governance strategy that you have for the HPC platform and some of the ways that you approach sustainability of the project.
Flavio Villanustre
0:51:08
So you're absolutely right, the community has apparently at least reached a plateau at psychological and HPC systems community, in number of people. Of course, it was the first to the open. So we have HVC for a very long time he was closed source, he was proprietary, and we didn't we at the time, we believed that he was so core to our competitive advantage that we couldn't afford to release it in any other way. When we realized that reality, the core advantage that we have is on one side data assets on the other side is the high level algorithms. We knew that the platform would be better sustained in the long Randy and sustainability is an important factor for the platform for us because the platform is so core to everything we do that we believe that making it open source and free, completely free as both a no just a freedom of speech, but also free beer. We we thought that that would be the way to ensure this long term sustainability and development and an expansion and innovation in the platform itself. But when we did that it was 2011. So it was a few years after Hadoop, Hadoop, if you remember, it started as part of another project around the web crawling and what called management, which eventually ended up It's a song top level Apache project in 2008, I believe. So it was already three or four and a half years after hundred was out there. And they're coming to us really large. So over time, we did gather a fairly active community. And today we have inactive a very technical, deeply technical community. That is that not just a helps with extending and expanding HPC, but also provides a VS use cases, sometimes interesting use cases of HPC and a and uses HPC in general and regular regularly. So he would it be system community continues to grow, the community seems to have reached a plateau. Now there are other communities out there, which also handle some of the data management aspects with their own platforms like spark I mentioned before, which seems to have a better performance profile than what Hello Cass. So it has been also gathered in active, active people in those communities. Well, I think open source is not a zero sum game where if a community grows, the other one will decrease and then eventually, the total number of people in the community will be the same across all of them. I think every new platform that introduces capabilities to open source communities and uses new ideas and and helps develops, apply innovation into those ideas is helping the overall community in general. So it's great to see communities like a spark community growing. And I think there's an opportunity, and many of the users in both communities are using both at some point for all of them to leverage what is that in the others. Surely, sometimes, the specific language using gold in the platforms, makes a little bit of a bit created a little bit of a barrier. Some of these communities are now just because of the way Java is potentially more common, that use Java instead of c++ and C. So you see that sometimes the people that are in one community who may be more versed in Java, feel uncomfortable going and trying to understand the code in the other platform that is coded in a different language.
0:54:52
But even even then, at least
0:54:55
semi generally VSVLO difference on the on the functions I capabilities can be extracted and used to be added. And I think this is good for the overall benefit of everyone. I see, in many cases open source as a as a experimentation playground, where people can go there can bring new ideas, apply those ideas to some code, and then everyone else eventually leverages them because these ideas percolate across different projects. It's It's It's quite interesting. Having been involved personally in open sources, the early 90s. I I'm quite fond of the of the process, open source work. I think it's it's beneficial to everyone in the in every community.
Tobias Macey
0:55:37
And in terms of the way that you're taking advantage of the HPC platform, Lexis Nexis and some of the ways that you have seen it used elsewhere. I'm wondering what are some of the notable capabilities that you're leveraging and some of the interesting ways that you've seen other people take advantage of it?
Flavio Villanustre
0:55:54
that's a that's a good question. And
0:55:56
that my the answer might take a little bit longer. So in the in Lexis Nexis, in particular, certainly we use HPC. For almost everything we do, because almost everything we do is data management in some way or data quality. Now, we have interesting approaches to things is we have a number of processes that are done on a on data. One of those is this prolific linkage process. And prolific linkage requires sometimes quite a bit of code to make it work correctly. So there was a point where we were ability to finish EL and he was creating a code base that was getting more and more sizable, larger, bigger, less manageable. So at
0:56:39
some point, we decided
0:56:41
that level of abstraction that is pretty high anyway, in ECL, wasn't enough for prolific data linkage. So we created another language we called it sold and we the unrelated language is open source, by the way, it's still providing, but that language is a language that is you're going to consider it a domain specific language for data Liggett productively only get and data integration, so that a compiler for salt, compile salt into CL, and they feel compelled by this EL into c++, c++, clang or GCC compiler into assembler. So you can see how abstraction layers or like layers in an audience, of course, every time you apply an improvement and optimization in the sale compiler, or sometimes the GCC compiler team applies an optimization. And you see everyone else on top of that, of that layer benefits from the optimization, which is quite interesting. We like it so much that eventually we have another problem, which is dealing with graphs. And when I say graphs, I mean social graphs rather than
0:57:46
charts.
0:57:47
So we built yet another language that deals with graphs and machine learning, and particularly machine learning in graphs, which is called Cal or knowledge engineering language, which by the way, we don't have an open source version, but we do have version of the compiler out there for people that want to try. So Gil, also generation CL, and E LD, my c++ and again, back to the same point. So this is a is an interesting approach to building abstraction by Creek, DSL, domain specific languages on top of ACL and other interesting application of HPC, outside of Nexus Nexus is there is a company that is it's called guard pad, they do have that are smart, they can do geo fencing for workers, they can do a detection of of risky environments, in manufacturing environment or in construction. And so they use HPC. And they use some of their real time integration that we have a with things like Afghan couch to be and other integrations I mentioned that we have word activity on integrating HPC with other open source projects to essentially manage all of these data, which is fairly real time.
0:58:57
And a create
0:58:58
this real time Allah and then real time, machine learning, execution for models that they have and integration of data and even visualization on top of it. And and there are more and more a good I could go for days, giving you some some of the ideas there of things that we have done an hour and or others in the community have done using HPC.
Tobias Macey
0:59:21
And in terms of the overall experience that you have had working with HPC on both the platform and as a user of it, what have you found to be some of the most challenging aspects and some of the most useful and interesting lesson for you've learned in the process?
Flavio Villanustre
0:59:38
That is a great question. And
0:59:40
I'll give you a very simple answer. And then I'll explain what I mean. What are some of the biggest challenges, if you are a new user is ECL. Some of the biggest benefits are ACL. Unfortunately, no, not everyone is, is well versed in declarative programming models. So when you are exposed for the first time to a declarative language that
1:00:04
has immutability and laziness. And
1:00:09
the no side effects, it makes sometimes a little bit of a brain twister in some way, right there, you get to, you need to think the problems in a slightly different way to be able to solve them. When you install that it used to embed the programming, you typically solve the problem by decomposing the problem into a just a recipe of things that the computer this process needs to do, step by step one by one, when you do the collective programming, you decompose the problem in a set of functions that need to be applied, and you build it from the ground up. This is slightly different type of, of approach. But it once you get the idea how this works, it becomes quite powerful for a number of reasons, it becomes quite powerful, because first of all, you get to understand the problem more, and you can express the algorithms in a far more succinct way, it would have been just a collection of attributes. And some of the attributes depend on other attributes that you have defined, it also helps you with better encapsulate the components in the problem. So now you're cold instead of becoming just some sort of a spaghetti that is hard to troubleshoot is willing calculated, both in terms of function and calculation, also dating calculation. So if you need to touch anything later on, you can do it safely without need to worry about what this function could be doing that I'm calling here to any to go and also look at the function because you know, there are no side effects. And it also gives you the ability to ECA if you of course, as long as you name your attributes correctly. So people understand what they they are attempting to do, are they they are supposed to do, you can collaborate more easily with other people as well. So after a while, I realized that I was building code in ECLM, and others have also the same way, then realize that they coded the writing the CL is, first of all, mostly correct most of the time, which is not what you do when you have a non declarative code programming. And you know that if the code compiles, there is high chance that the code will run correctly. And it will give you a correct results after it runs. And like I say, was explained before when you have a dynamically typed language is imperative programming with side effects were, surely they called my compile, and maybe it will run fine if your times, but one day, it may give you some sort of runtime error, because some type is mismatch or some side effect that you consider when you re architect some piece of the code now is kicking in and getting your your results different from what you expected. I think, again, a CL has been really quite quite a blessing from that standpoint. But of course, it does require that you learn this you want to learn and you learn this new methodology of programming, which could be similar to what someone that knows, Python or Java needs to learn in order to apply SQL and execute against another declarative language. So use on code SQL interactively. When you are trying to
Tobias Macey
1:03:34
query a database looking forward in terms of the medium to long term, as well as some of the near term for the HPC platform itself. What do you have planned for the future, both in terms of technical capabilities and features, but also as far as community growth and outreach that you'd like to see.
Flavio Villanustre
1:03:53
So from the technical capabilities and features, we tend to have a community roadmap of things and try to as much as we can to stick with those roadmap. So we have some, these big ideas that tend to get into the next or the following major version, these smaller ideas that are typical, non disruptive, and don't break past compatibility that go into the minor versions. And then of course, these bug fixes.
1:04:23
Like many say they are not bugs, but opportunities.
1:04:26
But in the great
1:04:28
at the big ideas side of things, some of the things that we've been doing is doing better integration of I mentioned before integration with other open source projects is quite important. We've been also trying to change some of the underlying components in the blood, there were some components that we have had for a very, very long
1:04:46
time, like, for example, the
1:04:48
underlying communication layer, in Roxy. And for that we think they may be right now for a further revamping, by incorporating some of the standard communications out there. There is also the idea of making the platform far more cloud friendly, even though it does run very well in many public clouds and OpenStack, and Amazon and Google and Azure. But we want to also make the clusters more dynamic. I don't know if you spotted when I said that when you when I explain how you do data management, we're too busy. And he said, Well, you have a note for Well, what happens when you want to change the tenor thought and make it a 20 or 30, or a five notes, or maybe you have a small process, that would work fine with just a couple notes or one knows, you have a large process that may need 1000 nodes. Today, you can dynamically resize the four cluster, surely you can do every if you can resize it by hand, and then do a reboot of the data and now have the data in the number of nodes that you have. But it is a lot more involved than we would like to see it with dynamic cloud environments, the facilities becomes quite important because that's one of the benefits of cloud. So making the classes also more elastic. more dynamic is another big goal. Certainly, we continue to develop machine learning capabilities. On top of it, we have a library of machine learning functions of their algorithms methods. And we are expanding that we sometimes have even some of these machine learning methods, which are quite, I would say innovative one of our core developers and also researchers developed a new distributed algorithm for K means clustering, which she hasn't seen in the literature before. So it's part of one one a paper and her PhD dissertation, which is very good. And the other one is also part of HPC. Now, so now people can leverage this, which gives a significantly higher scalability to K means, particularly if you're running a very large number of nodes, I'm going to get into the details and how it is it creates he said this far better performance. But in in some it distributes the data less. And instead the students the center, it's more and it uses the associative property of the gaming, the main loop of played k means clustering to try to minimize the number of data records that need to be moved around. That's it from the standpoint of the roadmap and the platform itself. On the community side, we continue to try to expand the community as much as we can. One of our core interests is to get I mentioned this core developer who is a also researcher, we want to get more researchers and an academia on the platform, we have a number of initiatives, collaboration initiatives, with a number of universities in the US and abroad university like Oxford University in the UK, University College London, Humboldt University in in Germany, and a number of universities in the US, Clemson University, Georgia Tech and Georgia State University and Annika so we want to expand this program more, we also have an internship program, we believe that one of the one of the things that we see are the goals that we want to achieve as well with with the HPC systems project open source project is to also help balance better the community behind it from balancing diversity across the community. So attracting both both generally but in general, generally vertically and regionally about diversity and background diversity. So we are trying to also put quite a bit of emphasis in students, even high school students, so we are doing quite a bit of activity with high schools, on one side trying to get them more into technology. And of course, learn HPC, but also the outside try to also get more women into technology get more people that otherwise wouldn't get into technology, because they don't get exposed to technology in their homes. And so that's another core piece of activity in HPC, the HPC community. Last but not least, as part of this diversity, there are certain communities that are a little bit more disadvantaged than others. One of those is people in the autism spectrum. So we have been doing quite a bit of activity with organizations that are helping these, these people. So also trying to enable them with a number of activities. And some of those have to do with training them into HPC systems as a platform and data management to give them open opportunities for them for their lives. Many of these individuals are extremely intelligent, they're they're brilliant, they may have other limitations because of their, their conditions. But they will be very, very valuable resources, not just flexible solutions. Ideally, we could tell you there but even for other organizations as well,
Tobias Macey
1:09:48
it's great to hear that you have all these outreach opportunities as well for trying to help bring more people into technology as a means of giving back as well as as a means of helping to be your community and contribute to the overall use cases that it empowers. So for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today,
Flavio Villanustre
1:10:19
I think there are a number of gaps, but the major one is, many of the platforms that are out there tend to be quite clunky, when it comes to integrating things. Unfortunately, we are at the point where we are not, I don't think we are mature enough. So I'm mature enough. I mean, if if you are a data management person, you know data very well, you know, data analytics, you know, data process, but you don't necessarily know operating systems, you don't know, you are not a computer scientist that can deal with data partitioning and computational complexity of algorithms in partition data. And, and there are many details that are necessary for you to do your job should be unnecessary for you to lose your job correctly. But unfortunately, today because of the state of things, many times many of these systems commercial and non commercial force you to take care of all of the details or assemble a large team of people, from system administrators to network, network administrators to operating system specialist to a middle layer, especially some build, you can build a system that can you do your data management, the and that's something that we we do try to overcome with HPC giving the screen in this homogeneous system that you deploy with a single command and that you can use a minute later, after you deployed it, I will say that we are in the ideal situation yet I think there is still much to improve on but I think we are a little bit further along than many of the other options out there. You if you know the the Hadoop ecosystem, you know, how many different components of that are out there. And you know, if you have done this for for a while, you know that one day you realize that they said either know a security vulnerability in one component MB, and now you need to update that. But in order to do that, you're going to break the compatibility of the new version with something else. And now you need to update that other thing. But there is no update for another thing, because that thing depends on another component. So yeah, and this goes on and on and on. So having something that is homogeneous, that it doesn't require for you to be computer scientist to deploy and use. And that truly enables you are the abstraction layer that you need, which is data management is a is a significant limitation of many, many systems out there. And again, not just pointing this at the open source projects, and also commercial product as well. I think it's something that some of the people that are designing and developing the systems might not understand because they are not the users. But they should think as a user, you need to put yourself in the shoes of the user in order to be able to do the right thing. Otherwise, whatever you build is pretty difficult to apply. Sometimes it's useless.
Tobias Macey
1:13:03
Well thank you very much for taking the time today to join me and describe the ways that HPC is built and architected as well as some of the ways that it's being used both inside and outside of Lexis Nexis. So I appreciate all of your time and all the information there. And it's definitely a very interesting system and one that looks to provide a lot of value and capability. So I appreciate all of your efforts on that front. And I hope you enjoy the rest of your day.
Flavio Villanustre
1:13:30
Thank you very much. I really enjoyed this and I look forward to doing this again. So one day we'll get together again. Thank you

Straining Your Data Lake Through A Data Mesh - Episode 90

Summary

The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?
    • What are some of the organizational and industry trends that tend to lead to this solution?
  • You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?
    • In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?
  • What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?
  • One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?
    • A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
      • Who is responsible for implementing and enforcing compliance regimes?
  • One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?
    • How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
      • Has latency of data retrieval proven to be an issue in your work?
  • When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?
    • How do team structures and dynamics shift in this scenario?
    • What are the necessary skills for each team?
  • Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?
  • Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?
  • For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?
  • Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto into Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities with the startups that are changing the world than Angel list is the place to go go to data engineering podcast.com slash angel to sign up today. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing reshare Mark de Gani about building a distributor data mesh for a domain oriented approach to data management. So Jamal, can you start by introducing yourself?
Zhamak Dehghani
0:02:05
Hi, Tobias. I am Gemma honey. And I am the technical director at thought works. a consulting company we do software development for many clients across different industries. I'm also a part of our Technology Advisory Board. That gives me the, I guess, the privilege and the vantage point to see all the projects and technology that we use globally, across our clients and digest information from that unpublished as part of our tech technology radar twice a year.
Tobias Macey
0:02:39
And do you remember how you got involved in the area of data management? Sure.
Unknown
0:02:42
So I guess, I've done earlier in my career, I have build systems, that part of what they had to do was, you know, as part of the kind of network management systems many years back
Zhamak Dehghani
0:02:57
collect,
Unknown
0:02:58
you know, real time data from data, we're set of devices across, you know, hundreds of nodes, and kind of manage that they just that act on it, it was part of some sort of a network management observe ability. However, post that experience, most of my focus had been on distributed architecture, at scale, and mostly on operational systems. So data was something that was hidden, you know, inside Operational Services in a distributed system, to do what to support what those operational systems need to do the last two years. And the last few years, both being on the tech advisory board for thought works, and also working with clients on the west coast of us where I'm located, I had the opportunity of working with teams that were building data platforms. So I support different technical teams that we have on different clients. And sometimes I go really deep on one client, and sometimes I've come across multiple projects. So I came from a slightly left field from, you know, distributed systems on operational systems to people who've been building, you know, who've failed, were struggling building kind of the next generation data platform for large retail or people who have been involved working with teams who are building the, you know, the the next generation analytical platform for one of the texture ends here, down in San Francisco. And struggling to scale that. So I started, I guess, working in the last couple of years working with teams that were heavily involved in, recovering from the past failures, and, you know, data warehousing data platforms, and building kind of the next generation and seeing the challenges and the experiences that we're going through, I'm sorry, I can't name many of these clients that I work with. But they often fall into the category of, you know, large, large organizations with fairly rich domains and rich data sets, where, you know, the state of the data is relatively kind of unpredictable, or has poor quality, because organizations have been around for a long time. And they've been trying to kind of harness the data and use it for for many years.
Tobias Macey
0:05:18
And so you recently ended up writing a post that details some of the challenges associated with data lakes and the monolithic nature of their architectural pattern and proposing a new approach that you call a data mash. So can you start by providing what your definition is of a data lake, since there are a lot of disagreements about that in the industry and discussing some of the problems and challenges that they pose as an architectural and organizational pattern?
Unknown
0:05:47
Sure. So maybe I give you a little bit of a historical perspective on the definition where it started, and what we see today as a general patterns. So I think, you know, data lake started the tech term that was coined by James Dixon in 2010. And what he envisage at the time was, was an alternative to data warehousing approach. But what he suggested was, data lake is a place for a single data source, to provide its data in its raw format. So that other people, whoever the consumers are, can decide how they want to use it, how they want to model it. So there, there was no, you know, the contrast was with data warehouse, in a way that, you know, data warehouse was producing pristine bottle data that were that they were, well model, they had very strong schemas and designed to address very specific, I guess, use cases and needs around business intelligence and reporting and so on. Data Lake based on James six and initial, I guess, definition was place that a single data source provide on bottle data, raw data, and that other people can use later on, he actually corrected and enhance that definition saying, what he intended for data lake to be to be one place for raw data from a single data source and an aggregator of data sources into, you know, raw data into one place from multiple data sources, what he called water garden, but I think the, he's, you know, he's thinking around raw data versus bottle data stirred people's imagination. And what data lake turned out to be is a place that aggregates raw data from all of the data sources, you know, all corners of the organization into one place. And then on to allow mostly analytical uses usage. And you know, data scientists kind of diving in and find out what they need to find out. And then it has evolved, even from then onwards, like, if you see, for example, different data lake solution providers, and you look at their websites, whether there are, you know, a famous cloud providers, like as your Google's so on, or, you know, other service providers, the carrot, I guess, recombination of a data lake is not only a place that right, broader data from all the sources in the organization would land. But also, it's a single place that, you know, pipelines to cleanse and transform and provide different views and different access on that data also exists as part of the pipeline of that data lake, I use the data lake, kind of, metaphorically, I meant by by data lake, really the incarnation of the current data lake as a single, centralized and kind of monolithic data platform to provide data for a variety of use cases, anything from, you know, kind of the business intelligence to analytical to machine learning based, accumulating data from all the domains and all the sources in the organization. So based on that definition, we can, I guess, jump into second part of your questions about some of the underlying kind of characteristics that these sort of centralized data platforms share.
Tobias Macey
0:09:18
Yeah. So I'm interested in understanding some of the organizational and technical challenges that they pose, particularly in the current incarnation of how everybody understands a data lake as being essentially a dumping ground for raw data that you don't necessarily know what to do with it yet? Or do you just want to be able to archive it essentially, forever? And also some of the organizational and industry trends that have led us to this current situation?
Unknown
0:09:46
Yeah, absolutely. I think I think since of so I wrote this, I can go into that in a minute. But I want to preface that by kind of the litmus test that this article has given me to validate whether the challenges that I had seen was widely applicable, you know, globally or not. So going to the challenges for a minute, but I want to, I guess, give a little bit of information on what has happened since I've written this article. So the challenges that I've observed that we can share in a minute, technical, their organizational challenges, I observed working, you know, with a handful of clients, and also seeing second hand from the projects that were running globally. However, since I've written the article, there are 10s of calls that I received from different people, different organization, kind of sharing the challenges that they have, and how much this had resonated with them. And it was a confirmation that it wasn't just a problem in a pocket that I was exposed to, and it's more widely spread. I think some of the underlying challenges are related. So So one is the same terms that we see. Right? So the symptoms, the problems that we see the symptoms are around mostly,
0:11:07
how do I bootstrap building a data lake? How do I, it's a big, you know, it's a big part of your infrastructure architecture is a central piece of your architecture, but yet, it needs to facilitate collaboration across such a diverse set of stakeholders in your organization, people that are building your operational systems that are collecting the source data, you know, generating, essentially that source data as the byproduct of what they're doing. And those stakeholders really have no incentive in providing that by product or by that data for other people to consume, because there is no direct incentive or feedback cycle into where the sources are. So there is this the central piece of architecture need to collaborate with diverse set of people that are represented source data, and also a diverse set of teams and business units that represent the consumers of the data, and all the you know, kind of possible use cases that you can imagine from company to become data driven, essentially, to make intelligent decisions based on the data. So the the symptoms that we see are either, you know, people are stuck bootstrapping, building such a monolith, and creating those all those points of integration and points of collaboration, or people that have succeeded to create some generation or some form of that data lake or data platform, they have failed to scale, either onboarding, you know, new data sources and deal with the proliferation of the data sources in organization, or they have failed to scale responding to the consumption models to different access points in the consumption models for that data. So you end up with this kind of centralized, rather stale, incomplete data set that really doesn't solve, you know, a diverse set of use cases, it might solve, you know, narrow set of use cases. And the fourth, I think, kind of failure symptom that I see is that, okay, I've got a data platform in place, but how do I change my organization to actually work differently, and use data for, you know, decision, intelligent decision making, you know, take augment applications to become smarter and use that data? So I just put the fourth, maybe failure mode aside, because there are a lot of cultural issues associated with that. And I want to focus on perhaps more architecture, I know, you can't really separate architecture from organization because they kind of mirror each other and focus on what are the root causes? What are the fundamental characteristics that any centralized data solution, whether it's the warehouse, or lake or your next generation be cloud based data platform share, that leads to those symptoms? And I think the first and foremost, is this, this assumption, that data needs to be in one place. And when you put things in one place, you create one team around managing it, organizing it, you know, taking it in, digesting it and serving it. And that fundamental centralized view. And centralized implementation is, by default, a bottleneck for any sort of scale of operations. So that limits how much organizations can really scale operationally use an intake of the data once you have a centralized team and a centralized architecture in place. The other characteristics that I characteristic that I think is leading to those problems, that's the single first and foremost thing is centralization, that, that conflicts, different capabilities. When you create one piece of you know, central architecture. And especially in the Big Data space, I feel there are two separate concerns that get conflated as one thing and has to be owned by one team. And one is the infrastructure, the underlying data infrastructure that serves, you know, hosting or transforming with access to the big data at scale. And the other concern is that domain, what are the domain data in the raw format or an aggregator model format, what are these domains that we actually try to put together in one place, and those concepts, the domains, and also the separation of infrastructure from the data itself gets completed in one place, and that becomes another, you know, point of contention and lack of skill. And that leads to leads to other challenges around, you know, siloed in people and siloed skill sets, that has their own, you know, impact that leads to kind of unfulfilled promise of big data by silencing people based on technical skills, you know, a group of data engineers or ml engineers, just because they know certain tools set around managing data from the domains where the data actually gets produced. And the meaning of that data is known, and separating them from domains that are actually consuming that data. And they are more intimately familiar in with how they want to use the data and separating them into a silo. That's another point of pressure for the system, which also leads to other you know, I guess, human impact, like the lack of satisfaction and the pressure that these teams are under, they don't really understand the domains and they get subjected to providing support and consuming ingesting this data making sense of it, and how fragile that interface between the operational systems and the data lake, Big Data Platform is because those operational systems and the lifecycle of that data changes very differently based on the needs of those operational domains, to the data that the team is consuming. And that becomes a, you know, a fragile point that you continuously playing catch up in the data team is continuously playing catch up with the changes that are happening with the data and the frustration of the consumers. Because the data scientist or a male engineers, or people that business analyst, I want to use the data are fully dependent on the silo data platform engineers to provide the data that they need in the in the way that they need. So you have, you know, a set of frustrated kind of consumers on the other side and poor data engineers in the middle trying to kind of work under this pressure. So there is I think there's a human aspect to that as well.
Tobias Macey
0:17:18
Yeah, it's definitely interesting seeing the parallels between the monolithic approach of the data lake and some of the software architectural patterns that have been evolving of people trying to move away from the big ball of mud, because of the fact that it's harder to integrate and maintain, and that you have the similar paradigm and the data lake where it's hard to integrate all the different data sources in the one system, but also between other systems that might want to use it downstream of the data lake, because it's hard to be able to separate out the different models or version, the data sets separately or treat them separate, because they're all located in this one area.
Zhamak Dehghani
0:17:56
I agree. I think I think
Unknown
0:17:58
the reason I have a hypothesis as why this is happening, why we are not taking learnings from the monolithic, you know, operational systems and their lack of scale and the you know, the the human impact of that, and not bringing those learnings to the data space, I definitely see, as you mentioned parallels between the two. And that's where it came from, I came kind of a bit of left field to this. And and to be honest, when I wrote this article, when I started first talking about it, I had no idea that I'm going to, you know, offend a large community and I'm going to get sharp objects thrown at me, or is this going to resonate? And luckily, it's been on the, you know, the latter side and, and a more positive feedback. So I think definitely there are parallels between the two, my hypothesis is why data engineering or beta data platform has been stuck at least six or seven years behind what we've learned in distributed systems architecture and more complex system design is that the evolutionary you know, thinking and the evolutionary approach to improving data platform is still coming from the same community. And the same frame of thinking that the data warehouse is built upon, even though we have embed improvements, right, we have changed. ATL like extract, transform load to extract low transform, LTS, essentially with data lake. But we are still fundamentally using the same construct, if you zoom into what the data warehouses, stages were, you know, in that generation data platform, and you look into, you know, even Hadoop based or whatever they did late based models of today, these still have similar fundamental constructs such as pipelines, ingestion, cleansing, transformation, serving as major first level architectural components of the Big Data Platform, we have seen layered, you know, functionally debate divided architectural components. Back in the day, when we try to, you know, 10 years ago, the couple manipulates the very first approach that enterprises to took back in the operational systems, when they were thinking about how the heck I'm going to break down this monolith to some sort of architectural quantum that I can independently somehow evolved was, well, I'm going to layer this thing, I'm going to put a UI on top and a business, you know, kind of process in the middle and data in the middle. And I'm going to bring all my DBS together to have to own this and manage this centralized database that is, you know, centrally managing data from different domains. And I'm going to structure the organization a structure, you know, my organization in layers. And that really didn't work. Because the change happen doesn't happen in one layer, how often do you replace your database, the change happens orthogonal Lee to these layers across them, right? When you build a feature, you probably need to change your UI and your business model and your data at the same time. And that led to this thinking that you know what, in a modern digital world, businesses moving fast, and the change is localized to those business operations. So let's bring this kind of domain driven thinking and build these microservices around the domain where the changes localized, so we can independently make that change. I feel like we are kind of following the same footsteps. So we've come from the data warehouse and a big data, you know, place and one team to rule them all. And then we're scratching our head to say, Okay, how I'm going to turn this into architectural pieces. So I can somehow scale it and well invert, you know, flip the layered model 90 degrees and tilt your head. And what you see is a data pipeline, I'm going to create a bunch of services to ingest and a bunch of services to transform and a bunch of services to serve, and have this data marts and so on. But how does that scale that doesn't really scale because you still have this point of handshake and point of friction between these layers to actually meaningfully create new data sets, create new access points, create new, you know, features in your data, your introduce new data, data products, in a way. And I think that's hopefully, we can create some cross pollination across the thinking that happening in operations and bring into the data, data. And that's what I'm hoping to do with the database was to bring those paradigms and create this new model, which happens at the intersection of those disciplines, we applied in operational domains to the world of data so we can scale it up.
Tobias Macey
0:22:24
And another thing that's worth pointing out from what you were saying earlier, is the fact that this centralized data lake architecture is also oriented around a centralized team of data engineers and data professionals. And part of that is because of the fact that, you know, within the past few years, it's been difficult to find people who have the requisite skill set, partly because we've been introducing a lot of new different tool chains and terminologies. But also, because we're trying to rebrand the role definitions. And so we're not necessarily as eager to hire people who knew who do have some of the skill sets, but maybe have gone under a different position title, whether it's a DPA or, you know, maybe systems administrators and trying to convert them into this new role types. And so I think that that's another part two of what you're advocating for with this data mash, and the realignment of how the team is organized in more of a distributed and domain oriented fashion and being embedded with the people who are the initial progenitors of the data and the business knowledge that's necessary for being able to structure it in a way to make it useful to the organization as a whole. So I guess you if you can talk through what your ideas are in terms of how you would organizationally structure the data mesh approach, both in terms of the skill sets and the team members, and also from a technical perspective, and how that contributes to a better approach for evolutionary design of the data systems themselves and the data architecture.
Unknown
0:23:59
Sure, I think, the point that you made around, you know, skill set and, and not having really created career paths for either software generalist to have the knowledge of all the tooling required to, you know, to perform operations on data and providing data as a first class asset, or, you know, or really siloed people into a particular role like data engineers who have chosen that path, perhaps from maybe a DB or data warehousing path, and not having had the opportunity to really perform as a software engineer
Zhamak Dehghani
0:24:39
and really act
Unknown
0:24:40
put the hat of a software engineer in place. And that has resulted into in a to a, you know, a silo of skill set, but also lack of maturity. So I see a lot of data engineers that are, you know, wonderful engineers, they knew their tools that they were using, but they haven't really adopted best practices of software engineering in terms of burgeoning, you know, continuous delivery, like presenting of the data, continuous delivery of the data, these concepts are not well established, or well understood, because they're basically evolving the operational domain, not the in the Data Domain. I many, you know, wonderful software engineers that haven't had the opportunity to learn spark and Kafka and you know, stream processing, and so on, so that they can add that to their tool belt. And I think for future in future generation of software engineers, hopefully not so far in the next few years. And any generalist kind of full stack software engineer will have, you know, the toolkits that they need to know and have expertise in to manipulate data in their tool belt. I'm really hopeful for that there is an interesting statistic from LinkedIn, and if I remember correctly, from 2016, and I doubt that has changed much since then they had 65,000 people that had declared them as data engineers on their site. And there were only there were 60,000 jobs available for data looking for data engineers only in the Bay Area. So that shows the discrepancy between what the industry means and what we are, you know, enabling our people, our engineers to to fulfill those roles. Going back to your that was sorry, my little rant about career paths and developing data engineer. So I personally with my team's
Zhamak Dehghani
0:26:30
welcome, the data enthusiasts
Unknown
0:26:33
that they want to become data engineers and provide career paths and opportunities for them to evolve in that role. And I hope other companies would do the same. So going back to a question around kind of what's the database? What are those fundamental organizational constructs and technical constructs that we need to put in place, I'm going to talk about the fundamental principles, and then hopefully, I can bring it together in one cohesive sentence to describe it. The first one fundamental principle behind data mesh is that data is owned and organized through the domains. For example, if you are in, let's say, in a health insurance domain, the claims and all the operational systems and you probably have many of them, that generate raw data about the claims that the members have put through that raw data should become the first class concern in your architecture. So the domain data, the data constructed or generated around a domain concepts such as claims, such as members, you know, such as clinical visits, these are the operational, these are the first class concerns in your structure, and in your architecture, which I call them domain data products in a way, the second and that comes from, you know, domain driven kind of distributed architecture. So what does that mean? That means that, at the point of origin systems that are generating the data, they're they're representing the facts of business as we are operating the business, such as events around claims, or perhaps even historical snapshots of the claims, or some current state of this claims. As they're providing those teams, the teams that are most intimately familiar with that data, are responsible to providing that data in a self serve consumable way to the rest of the organization. So the ownership, the one of the constructs of principles is that the ownership of data is now distributed. And it's given to people who are best suited to know and own that data. So that ownership can happen at multiple places, right, you might have your source operational systems that would now own a new data set or streams of data, whatever format is most suitable to represent that domain to own that data. And I have to clarify that that broad data that those you know, systems of origin generate, we're not talking about their internal operational database, because internal operational databases and data sets are designed for a different purpose and intent, which is make my system work really well. They're not designed for other people to get a view of what's happening in the business in that domain, and capturing those facts and realities. So it's a different data set, this is different, whether it's a stream, very likely, or it's a time series of whatever format is, is the data that is native data sets that are owned by systems operation, and people who are operating those systems. And then you have data sets that maybe there are more aggregate views, for example, in a domain for, again, health insurance as an example, you might want to have predictive points of intervention, or you know, critical points of contact, that you want the care provider makes, you know, contact with member to make sure that they getting the health care that they need at the right point in time, so that they don't get sick and make a lot of claims on insurance at the end of the day. So that domain itself that is responsible for making those contacts and making those smart decisions and predictability as when and where I need to contact a member, they might produce an aggregate view of the member, which is the historical, you know, records of all the visits that the member is done, and all the claims that says a joint aggregate view of the data. But that data might be useful not only for their domain, but other other other domains. So that becomes another domain driven data that that the team is providing for rest of organization to support. So that's kind of the distributed domain aspect of it. The second principle behind that is for any for data to be really treated as asset for for it to be in a distributed fashion, be consumed by multiple people and still be joined together and filtered and been in a meaningful we aggregated and in a self serve way us, I think we need to bring product thinking to to the world of data. So that's why I call this things kind of domain driven, or domain data products, product thinking in a technology space, what does that mean? That means I need to put all the technology, you know, kind of characteristics and tooling, so that
0:31:29
I can delight the, you know, I can provide a delightful experience for people who want to access the data. So these people might be you might have different types of consumers, they might be data scientists, maybe maybe they're just, they want to just analyze the data and run some queries to see what it's they, they may may not want it, they may want to use that data to convert it to some other, you know, easy to understand way of data to, you know, kind of spreadsheet so that you have this diverse set of consumers for your data sets. But for them for this data set to be considered a product, a data product and bring product thinking, you need to think about, okay, how I'm going to provide discover ability. So that's the first step, how can somebody find my data product. So it's a discovery ability, how can the addresses so they can programmatically use it, standards stability, I need to make sure that I put enough security around it so that whoever is authorized to use it can use it, and they can see things that they should see and not see things they shouldn't see. So the security around it. And well, for this to be self serve as a product, I need to really have good documentation, maybe example data sets that they can just play with and see what's in the data, I need to provide the schema so they can see structurally what it looks like. So all of the tooling around kind of self describing and supporting kind of the, again, the understanding what the data is. So there is a set of practices that you need to apply to data. So for data to become an asset, self. And finally, the third, I think discipline that intersects of this discipline would help the Dana mesh is the platform thinking part of it. So at this point of conversation, usually people tell me Hold on a minute, you've asked me now to have independent teams, all of my operational teams to own their data, and also serve it in such a self serve, you know, easy to use way, that's a lot of repeatable, you know, metal work that these teams have to do to actually get to the point that they can provide the data like all the pipelines that internally they need to build, so that they can extract maybe data from the databases or have a new, you know, event sourcing in place. So that leads into, you know, transmitting or emitting the events, there is a lot of work that needs to go documentation discoverable at, how can they do this, this is a lot of costs, right? So that's where it kind of the data infrastructure or platform thinking comes to play, I see a very specific role in a data mesh in this mesh of, you know, domain oriented data sets for infrastructure for what I call self serve data infrastructure. So all the tooling that we can to put in place on top of our raw data infrastructure, so the raw data infrastructures, you know, your storage and your
Zhamak Dehghani
0:34:21
backbone messaging or
Unknown
0:34:23
backbone, event logs, and so on. But on top of it, you need to have a set of tooling that these data product teams can essentially use easily, so that they can build those data products quickly. So we put a lot of capabilities when we build a database into that layer, which is the self serve infrastructure layer for supporting discover ability, supporting, you know, documentation, supporting secure access, and and, and so on, so that our teams and the success criteria for that infrastructure, tooling, data infrastructure tooling, is that decrease lead to, for a data product team or an operational team to get their data set on to this kind of mesh mesh platform. And, and that's, that's, I think, a big part of
Zhamak Dehghani
0:35:11
it. And
Unknown
0:35:13
I think it's, it's at this point is easy to imagine that, okay, if I have now domain oriented data
Zhamak Dehghani
0:35:20
product, clearly, I need to have
Unknown
0:35:22
people who can bring product thinking to those domains. So you have people that play kind of the data product owners, and they might be the same same person as a tech lead or same person as they, you know, product on every operational system. But what they care about is data as a product they providing to the rest of the organization. So the lifecycle of that data visioning. What features that what sort of, you know, elements they want to put into it? What is the kind of service level objective? In a way? What are the quality metrics that we support, you know, if this time is data is almost near real time probably have missing or has some duplicate events in there. But maybe that's, you know, that's an accepted kind of service level agreements in terms of the data quality, and they think about those, you know, all the stakeholders for that data. And similarly, now, our software engineers, who are building operational systems also have skill sets around using spark or using air flow, or all of the tooling that they need to implement their data products. And similarly, the infrastructure folks that are often you know, dealing with your compute system, and you know, your storage and your build pipelines and have providing those tools as you know, self serve tooling. Now, they're also thinking about big data storage now. So they're thinking about, okay, data discovery. And that's an evolution that we've seen kind of when API, you know, revolution happened, a whole lot of technology came into infrastructure, like the API gateways and API documentation, and so on, that becomes part of the service infrastructure in a way. And I see the same hopefully, the same kind of change with happen with infrastructure folks supporting distributed data mesh.
Tobias Macey
0:37:07
Yeah, I think that, you know, when I was reading through your posts, one of the things that came to mind is a parallel with what's been going on with technical operations, where you have an infrastructure operations team that's responsible for ensuring that there is baseline capacity for being able to provide access to compute and storage and networking. And then you have layered on top of that, and some organizations, the idea of the site reliability engineer that act as a consultant to all the different product teams to make sure that they're aware of and thinking about all of the different operational aspects of the software that they're trying to deliver and drawing a parallel with the data mash, where you have this data platform team that's providing access to the underlying resources for being able to store and process and distribute the data. And then having a set of people who are either embedded within all of the different products teams, or actually as a consultant to the different products, teams to provide education and information about how they can leverage all these different services to make sure that the data that they're providing within their systems and that they're generating within their systems is usable downstream to other teams and other organizations, either for a direct integration or for being able to provide secondary data products from that in a reusable and repeatable manner.
Unknown
0:38:26
Yes, definitely the parallels, and I think we can learn from those lessons, you know, we have gone through the Dev and Ops separation, and we, you know, with DevOps, we brought them together. And as a side effect of that we generated a, we created a wonderful generation of engineers called SRE. So I think, absolutely the same, the same parallels can be applied and those learnings can be applied to the world of data. One thing I would say, though,
Zhamak Dehghani
0:38:54
going into this,
Unknown
0:38:56
you know, distributed model, with distributed owners ship around data domains with different you know, infrastructure, folks supporting that bringing product thinking and platform thinking and all of that together to it. It's a hard journey. And it's counter culture, to a large degree to what we have today in organizations, you know, today, data platform, data lake is a separate organization, all the way to the side, really not embedded into the operational systems, what I like to hope to see is that data becomes the fabric of the organization. Today, we do not argue about having API's, you know, serving capabilities to the rest of the organization and from services that are owned by different people. But for some reason, we argue that that's not a good model for sharing data, which I'm still puzzled by. So data right now is separated, people are still thinking, you know, there's a separate team deal with my data data is like, the operational systems and folks are not really incentivized to own or even provide, you know, kind of trustworthy data, I think we still have a long way to go for organizations to reshape themselves. And there is so much challenge, there is a lot of friction and challenge between operational systems and the data that needs to be unlocked. And the consumption of that in a meaningful way. I definitely I am working with and also have seen pockets of this being implemented by the four people who get started and aware I'm starting also with a lot of our clients is not be is not the one, it doesn't look like a distributed ownership and a distributed team. So on day one, we in fact, do start with maybe one physical team that brings people from those operational domains as SMEs into the team. It brings infrastructure folks into the team. But though it's a one physical team, the underlying tech we use, like the repos for codes that generate data for different data domains are separate repos, they're separate pipeline's, the, you know, infrastructure has a separate backlog for itself. So virtually internally have separate teams, but they're still working on day one, they're still working under one program of work. And hopefully, as we go through the maturity of, you know, building this in a distributed fashion, those internal virtual teams that are, you know, they're they've been working on their separate repo, and they have a very explicit domain that they're looking after, then they can kind of once we have enough of the infrastructure in place to support those, you know, domains to become autonomous, completely autonomous, then they can go out and, you know, be turned into long standing teams that are responsible for that data for that domain. So that's, though that's a target state, I do want to just, I guess, mentioned that there is a maturity and there is a journey that we have to go through and probably make some mistakes along the way, and learn and correct to get to get to that point at scale. But I'm very hopeful, because I have seen this happen and power, as you mentioned, parallel in the operational world. And I think we have a few missing pieces for it to happen in the data world. But I'm hopeful based on the conversations that I had with with companies, since this has started, I've heard a lot of you know, horror stories and failure stories and the need for change. So the need is there have also been talking to many companies that have come forward and say we were actually doing this, let's let's talk. So they haven't maybe publicly published and talked about their approach. But they're internally there in the kind of the leading front of changing the way and challenging the status quo around data management.
Tobias Macey
0:42:39
Yeah, one of the issues with data lakes is up the face value, the the benefit that they provide is that all of the data is co located. So you reduce latencies when you're trying to join across multiple data sets, but then you potentially lose some of the domain knowledge of the source teams or source systems that were generating the information in the first place. And so now we're moving in the other direction of trying to bring a lot of the domain expertise to the data and providing that in a product form. But as part of that, we then create other issues in terms of discover ability of all the different data sets consistency across different schemas for being able to join across them, where you, if you leave everybody up to their own devices, they might have different schema formats, and then you're back in the area of trying to generate Master Data Management Systems and spend a lot of energy and time trying to be able to coerce all of these systems to have common view of the information that's core to your overall business domain. And so when I was reading through the post, initially, I was starting to think that, you know, we're trading off one set of problems for another. But I think that by having the underlying strata of the data platform team, and all of the actual data itself is being managed somewhat centrally from the platform perspective, but separately from the domain and expertise perspective, I think we're starting to reach a hybrid where we're optimizing for all the right things, and not necessarily having too many trade offs in that space. So I'm curious what your perspective is, in terms of those areas of things like data discovery, schema consistency, where the responsibilities lie between the different teams and or and organizationally for ensuring that you don't end up optimizing too far in one direction or the other?
Unknown
0:44:35
Absolutely, I think you made a very good point and very good observation, we definitely don't want to swing back into data silos and you know, inconsistency, the fundamental principle behind any distributed system to work is interoperability. So you mentioned a few things. One is the data locality for performance, like, you know, data gravity, where the operations the computation will happen, like, for example, your pipelines accessing that data or running close to your data and the data is co located, I think those underlying physical implementations, we should bring those learnings from data lake design into the database, they should be absolutely there. And distributed ownership model does not mean a distributed physical location of the data. Because as you mentioned, the data infrastructure team is their job is providing that, you know, location agnostic, or fit for purpose location as a service to or storage as a service to those data productions. So not every team willy nilly deciding where I'm going to store my data, or how am I going to store the data, there is a, you know, infrastructure that is set up to incorporate all of those, you know, performance concerns, and the scalability concerns and Nicola location concern of the data and computation and provide, you know, in a in somewhat agnostic way to those data product teams, where data should be located. So physically, if I'm on the same, you know, a bunch of his three buckets or on an instance of Caprica, or wherever it is, as a data product team, I shouldn't care about that I should go to data in my data infrastructure to get that service from them. And they are best positioned to tell me where my database on my needs, obviously, and the scale of my data and the nature of it, it goes. So we still can bring all of those best practices around the location of the data, and so on into the infrastructure. Discovery ability is another aspect of it. So I can, I think one of the most important properties of a database is having some sort of a global and, you know, single point of access for a single point of discovery system so that as someone who wants to generate a new solution, I want to make a smart decision about, you know, get some insight about my patients and my health insurance example,
0:47:01
I need to be able to go to one place, which is my, you know, data catalog, for lack of better world data discovery system, that I can see all of the data products that are available for me to use, I can see who owns them, you know, what are the meta information about that data, what's the quality of it, when was the last time it's got updated with a cadence of it being updated, get, you know, sample data to play with so that discovery ability, kind of function, I see that as a centralized and a global function, that in a federated way, you know, every data product can register itself, you know, with that with that discovery tool. And we have primitive implementation of that, which are just conference pages to more kind of maybe like advanced implementation. So it's just that's that's globally accessible and centralized, I think solution. And one other thing that you mentioned was well, now those teams that that you leave the decision around how the data should be provided and what the schema should be, that hasn't changed from the Lakers delay concern, like implementation that leaves that also to the at least in this pure initial definition to the data sources. But as I mentioned, interoperability, and standardization is a key element. Otherwise, we end up with a bunch of domain data that there is no way I can join them together. For example, there are some concerns some entities like a member, right, the member in a health insurance, and it claims system probably has its own internal IDs in I don't know, the health care provider system has its own internal ID. So the member is a policy entity that crosses different domains. So if we don't have standardization around adapting those internal member IDs to some sort of a global member ID where he can join the data across different domains, these disparate domains data, they just good by themselves that don't allow me to do you know, merging and filtering and searching. So the governance that global governance and standardization is absolutely key. And that governance and standardization sets, you know, set standards around anything that allows interoperability from you know, how the maybe formats of certain fields, how we represent the time and, you know, time zones to how do we do policies of federated identity, identity management, to what is the unique way of securely accessing data, so we don't have five different ways of security implementation or secure access model. So all of those concerns, I think, part of that global governance and implementation of that, as you mentioned, goes into your data infrastructure as as tooling that people can easily use as shared tooling. The API revolution wouldn't have happened if we didn't have a dp as a base as a baseline standardization, race breast as a standardization. So these are standards have allowed us to, to be able to decentralize our monolithic systems, we had the you know, API gateways and API documentation is a centralized place to find things like find what are the API's I need to use. So I think that same same concerns, and the best practices out of the lake should come into the database and not get lost.
Tobias Macey
0:50:14
And from an organizational perspective, particularly for any groups that already have an existing data like implementation, talking through this, I'm sure it sounds very appealing. But it can also sound fairly frightening because of the amount of work that's necessary. And as you mentioned, the necessary step is to take a slow and iterative approach to going from point A to point B. But there are also scales of organization where they might not have the capacity to even begin thinking about having separate domains for their data, because everybody is just one group. And so I'm curious what your perspective is on how this plays out for smaller organizations, both in terms of the organizational patterns, but also in terms of the amount of overhead that is necessary, sorry, for this approach, and whether there is a certain tipping point where it makes more sense to go from a data lake to a data mash, or vice versa?
Unknown
0:51:08
That's a very good question. On the smaller kind of scale organizations, if you if they have a data lake, if they have a centralized team that is ingesting data from many different systems, and this is this team is separate from the operational teams, but somehow, because of the size they can manage, and they have, they have closed the loop. And by closed loop, I mean, it's not that they just, you know, consume data and put it in one place, but also have satisfied the, the requirements of use cases to use the data and they have become a data driven company, if you're there and you have managed to, you know, build a close tight loop between your operational systems providing the data and you know, intelligence services or nlb services that are sitting on top of the data and consuming it and you have a fast, you know, fast cycle of iteration to to update data, and also to update your models. And that's working well for you. There's no need to change a lot of startups and scale ups, you know, they're still using their monolithic Rails applications that build the first day, and that is still okay. But if you have pain, and if the pain you're feeling is around ingestion of the data and keeping that data up to date, understanding the data you have you see fragility, if that's a word, you see the fragile points of connectivity between the source source being changed and you know, your your pipelines falling apart, or you you are struggling to respond to this kind of the innovation agenda of your organization. We, we we talked about test and learn, and build, test and learn, which requires a lot of experimentation. And your team is struggling to respond to change data to support those experimentation, that I think that's an inflection point to start thinking about how can I decouple the How can I decouple the responsibility and a starting point, a very, very, you know, simple starting point would be let's decouple your infrastructure from the data on top of it. So your data engineers probably today are both looking at the infrastructure pieces, which are cross domain, they really don't don't matter what Data Domain they're serving, or to transforming their cross domain, separate that into a separate team, you know, put a success criteria for them design success criteria, that is, you know, it's aimed to enable people who want to provide data on top of that infrastructure, separate that layer first, when you come to the data itself, find out the domains that are going through change more often, the domains that are you know, changing continuously. Often, those domains are not your, you know, customer facing applications, or the often not the systems, operational systems at the point of origin, they may change, but the data they're emitting not changed, but maybe they do. Often, they're the ones that are sitting in the middle and their aggregates. But however, find out which data domain is changing most rapidly separate those separate that one first. So you know, put even a logical maybe your data team is still the same team. But within that team, some people become the the owners or the custodians of that particular domain, and bring those people together with the actual systems that are emitting the data. If it's a system of origin, or, you know, teams that are aggregating the data in the middle, if it's an aggregate data and start, start experimenting with having built that data set as a separate data product in a way and see how that's working. That would be I think that would be where I would start in a smaller organization, organizations that already have, you know, many, many data warehouses, multiple iterations of a nation's of it, and they're somewhat disconnected and they have a problem la can maybe it's not really working for them. Again, if you see the failure symptoms that I mentioned at the beginning of this conversation, you can scale you can bootstrap, you're not having become data driven. Again, start with finding those use cases of we I always took a value driven use case driven approach, find a few use cases where you can really benefit from data that is directly coming from the source data that is timely and rich, that is changing more often will find use cases of for that data, not just the data itself. But whether the ML use cases or bi use cases, find use cases, group them together, identify the source of real source of data, not intermediary, you know, data warehouses, and start building out this data mesh, one iteration one use case at a time, you know, a few data sources at a time. And that's exactly what I'm doing right now. Because organizations that I work with, they thought mostly fall into that category, that kind of hairy data space that needs to increase mentally change. And there's still you know, there is a place for lay there still a place where data warehouse, like a lot of the BI use cases do require and multi dimensional schema well defined, but they don't need it to need to be the centerpiece of this architecture, they become a node on your mesh, mostly closer to the consumption. Use Cases, because they satisfy a very specific set of use cases around business intelligence. So they become the consumers of your measure, not not the central piece for providing the data.
Tobias Macey
0:56:32
And in your experience of working with different organizations, and through different problem domains and business domains. I'm wondering if there are any other architectural patterns or anti patterns or design principles that you think that data professionals should be taking a look at that aren't necessarily widespread within that community?
Unknown
0:56:51
Yes, I think a lot of the practices that we take somewhat for granted today, if you're building a modern, you know, the digital platform, and what modern web application, I think there is, there is a space to bring those practices to data and both data and ml, I think continuous delivery is an example of that, you know, all the code and also the data itself. Also, the models that you create are under the control of your continuous delivery system, there are versions, they have integrity tests, and, you know, test around that, you know, making that deploy to production, which means release it so that it's accessible to the rest of the organization. So I would say the basic, you know, good engineering hygiene around continuous delivery, those are not really well understood, or well established concern concepts in both data and ml. So a lot of our ML engineers or data scientists, they don't even know what does continuous delivery for MMO look like because from you know, making a model in our on your laptop, there is months of work and translation and somebody else ready code for you and putting something on the, you know, in production that using it stale data that it was trained on, and there is no concept around, oh, you know, my data changed. Now I need to kick off another build cycle, because I need to retrain my much my, you know, machine learning algorithms, so that that continuous delivery, both in AI and continuous delivery, both in data world as well, like data integrity tests, and testing for data is not a well understood practice, Virginia data is not a well understood practice. So think schema visioning or backwards compatibility, I would say that would be some of the early kind of the technical concerns, and capabilities that are would introduce in data engineering teams or ml engineering teams.
Tobias Macey
0:58:47
And are there any other aspects of your thinking around data mash, and the challenges of data lakes or anything else pertaining to the overall use of data within organizations that we didn't discuss yet that you'd like to cover? Before we close out the show,
Unknown
0:59:03
I think we pretty much covered everything I would probably over maybe emphasize a couple of points, I think making data as a first class concern, an asset, you know, and, and structured around your domains, it does not mean that you have to have a well model data, it could be your raw data, it could be the right events that are being generated from the point of urgency, but various product thinking, and, you know, self serve, and some form of IP understood or measured quality and good documentation around it, so that other people can use it, and you treat it as a product. But it doesn't necessarily mean we were doing a lot of modeling of the data. The other the other thing that I would mention, I think it's I guess we have already talked about it, I think the governance and standardization I I would love to see more standardization, the same way that we saw with the web, and, you know, with API's apply to data. So we don't have a lot of either open source, like a lot of different open source tools, or a lot of different, you know, kind of proprietary tools. But there isn't a, you know, there isn't a standardization that allows me, for example, to run distributed SQL queries across a diverse set of data sets. I mean, the cloud providers are in the race to provide all sorts of, you know, wonderful data management capabilities on their platform. And I hope that race would lead in some form of standardization that allows, you know, distributed systems to work. And intentionally I think a lot of the technologies we see today, even around data discovery is, is based on the assumption that data is operational data hidden in some database in a corner of the organization is not intended for sharing, but we need to go find it, and then extract the data out of it. I think that's Florida predisposition, I think we need to think about tooling that would allow intentionally diverse intentionally shared, diverse set of data data sets, and what does it mean? Like there's a huge opportunity for tool makers out there, I think there is a big white space to build next generation tools that are not designed to fight the data. The fight is bad data hidden somewhere. They're designed to share and make, you know, intentionally shared and intentionally treated as assets, data discoverable and accessible and, you know, measurable and clearable. But distributed Lee owned kind of data sets. So I think those are the few final points to overemphasize.
Tobias Macey
1:01:50
All right. And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And we've talked a bit about but I'd also like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Unknown
1:02:08
Standardization, I would just, if I could wish for one thing was a little bit of a convergence and standardization in that allows still a polyglot world, you know, you can still have a polyglot world, but I want to see something like, you know, convergence that happened around Kubernetes in the infrastructure and operational world, some similar similar or the standardization that we had with, you know, history, TP and re, you know, rest or gr PC and so on in the world of data so that we can support a polyglot, you know, an ecosystem. So I think I'm looking for tools that are ecosystem players and kind of supported distributed in a polyglot data world, not data that can be managed just because we put it in one database or one data store, just because it's used by one party owned and ruled by one particular tool. So open standardization around data is what I'm looking for. And there are some, you know, small movements. Like, if you look at the end, they're not coming, unfortunately, not coming from the data world. Like, for example, the work of CNC f off the back of the circle is thinking about if the events are one of the fundamental concepts, you know, talking about cloud events as a standard way of describing events. But that's coming from left field, again, that's coming from an operational world trying to play in an ecosystem, not a data world. And I hope we can get more of that from the data world.
Tobias Macey
1:03:33
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing and your thoughts on the matter of how we can make more scalable and maintainable data systems in our organizations. And as an industry, it's definitely a lot to think about. And it mirrors a lot of my thinking in terms of the operational characteristics. So it's definitely great to get your thoughts on the matter. So thank you for that. And I hope you enjoy the studio. Thank
Unknown
1:04:00
you, Tobias. Thank you for having me. I've been a big fan of your work and what you're doing, getting the you know, information about this silo of data management, to the to everyone and really making that information available to me at least since I've become a data enthusiast. Thank you for having me. I'm happy to help
Tobias Macey
1:04:18
Have a good day.

Maintaining Your Data Lake At Scale With Spark - Episode 85

Summary

Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Delta Lake is and the motivation for creating it?
  • What are some of the common antipatterns in data lake implementations and how does Delta Lake address them?
    • What are the benefits of a data lake over a data warehouse?
      • How has that equation changed in recent years with the availability of modern cloud data warehouses?
  • How is Delta lake implemented and how has the design evolved since you first began working on it?
    • What assumptions did you have going into the project and how have they been challenged as it has gained users?
  • One of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested?
    • In your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.)
  • Can you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store)
    • Are there limits in terms of the volume of data that can be managed within a single transaction?
  • How does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user?
    • The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach?
  • What have been the most difficult/complex/challenging aspects of building Delta Lake?
  • How is the data versioning in Delta Lake implemented?
    • By keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process?
  • What are the reasons for standardizing on Parquet as the storage format?
    • What are some of the cases where that has led to greater complications?
  • In addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date?
  • When is Delta Lake the wrong choice?
    • What problems did you consciously decide not to address?
  • What is in store for the future of Delta Lake?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building An Enterprise Data Fabric At CluedIn - Episode 74

Summary

Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

  • Introduction

  • How did you get involved in the area of data management?

  • Before we get started, can you share your definition of what a data fabric is?

  • Can you explain what CluedIn is and share the story of how it started?

    • Can you describe your ideal customer?
    • What are some of the primary ways that organizations are using CluedIn?
  • Can you give an overview of the system architecture that you have built and how it has evolved since you first began building it?

  • For a new customer of CluedIn, what is involved in the onboarding process?

  • What are some of the most challenging aspects of data integration?

    • What is your approach to managing the process of cleaning the data that you are ingesting?
      • How much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution?
    • How do you preserve and expose data lineage/provenance to your customers?
  • How do you manage changes or breakage in the interfaces that you use for source or destination systems?

  • What are some of the signals that you monitor to ensure the continued healthy operation of your platform?

  • What are some of the most notable customer success stories that you have experienced?

    • Are there any notable failures that you have experienced, and if so, what were the lessons learned?
  • What are some cases where CluedIn is not the right choice?

  • What do you have planned for the future of CluedIn?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Upsolver is and how it got started?
    • What are your goals for the platform?
  • There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
    • What are the shortcomings of a data lake architecture?
  • How is Upsolver architected?
    • How has that architecture changed over time?
    • How do you manage schema validation for incoming data?
    • What would you do differently if you were to start over today?
  • What are the biggest challenges at each of the major stages of the data lake?
  • What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
  • When is Upsolver the wrong choice for an organization considering implementation of a data platform?
  • Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
  • What features or improvements do you have planned for the future of Upsolver?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Iceberg is and the motivation for creating it?
    • Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
  • How has the use of Iceberg simplified your work at Netflix?
  • How is the reference implementation architected and how has it evolved since you first began work on it?
    • What is involved in deploying it to a user’s environment?
  • For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
    • Is there a migration path for pre-existing tables into the Iceberg format?
  • How is schema evolution managed at the file level?
    • How do you handle files on disk that don’t contain all of the fields specified in a table definition?
  • One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
  • What are the unique challenges posed by using S3 as the basis for a data lake?
    • What are the benefits that outweigh the difficulties?
  • What have been some of the most challenging or contentious details of the specification to define?
    • What are some things that you have explicitly left out of the specification?
  • What are your long-term goals for the Iceberg specification?
    • Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Summary

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by establishing a definition of data mastering that we can work from?
    • How does the master data set get used within the overall analytical and processing systems of an organization?
  • What is the traditional workflow for creating a master data set?
    • What has changed in the current landscape of businesses and technology platforms that makes that approach impractical?
    • What are the steps that an organization can take to evolve toward an agile approach to data mastering?
  • At what scale of company or project does it makes sense to start building a master data set?
  • What are the limitations of using ML/AI to merge data sets?
  • What are the limitations of a golden master data set in practice?
    • Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?
    • Are there specific problem domains that are more likely to benefit from a master data set?
  • Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data)
  • What storage mechanisms are typically used for managing a master data set?
    • Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?
    • How do you manage latency issues when trying to reference the same entities from multiple disparate systems?
  • What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?
    • What suggestions do you have to help prevent such a project from being derailed?
  • What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Pachyderm with Daniel Whitenack - Episode 1

Summary

Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake.

Interview with Daniel Whitenack

  • Introduction
  • How did you get started in the data engineering space?
  • What is pachyderm and what problem were you trying to solve when the project was started?
  • Where does the name come from?
  • What are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options?
  • Because of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics?
  • What does Pachyderm use for the distribution and scaling mechanism of the file system?
  • Given that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data?
  • For a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow?
  • Given that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that?
  • Another compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions?
    • How did you implement this feature so that it would be maintainable and easy to implement for end users?
  • Given that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this?
  • The data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow?
  • I see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill?
  • What are some of the most interesting deployments and uses of Pachyderm that you have seen?
  • What are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project?

Keep in touch

Free Weekend Project

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA