Summary
Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
- What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
- Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
- Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
- For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
- How does HPCC Systems manage persistence and scalability?
- What are the integration points available for extending and enhancing the HPCC Systems platform?
- What is involved in deploying and managing a production installation of HPCC Systems?
- The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
- How does the Thor engine manage data transformation and manipulation?
- What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
- For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
- How are you using the HPCC Systems platform in your work at LexisNexis?
- Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
- How is the HPCC Systems project governed, and what is your approach to sustainability?
- What are some of the additional capabilities that are only available in the enterprise distribution?
- When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
- What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
- What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
- What do you have planned for the future of HPCC Systems?
Contact Info
- @fvillanustre on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- HPCC Systems
- LexisNexis Risk Solutions
- Risk Management
- Hadoop
- MapReduce
- Sybase
- Oracle DB
- AbInitio
- Data Lake
- SQL
- ECL
- DataFlow
- TensorFlow
- ECL IDE
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events, including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graphorum. Go to data engineering podcast.com/conferences to learn more and take advantage of our partner discounts when you register.
[00:01:25] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Flavio Villanustre about the HPCC project and his work at and his work at LexisNexis
[00:01:33] Unknown:
Risk Solutions. So, Flavio, can you start by introducing yourself? Of course, Tobias. My name is Flavio Villanustre. I'm vice president of technology and CISO for Lexanex III Solutions. We in the in Lexanex III Solutions, we, have a data platform, called the HP systems platform. We, have, made that platform open source in 2011. And since then, I've been also, as part of my role, involved with leading the open source community initiative, ensuring that the, open source community, truly leverages the platform, helps contribute to the platform, and, certainly, creating a liaison between, the, Lex Nexus Solutions organization and the rest of the,
[00:02:18] Unknown:
larger open source community. And do you remember how you first got involved in the area of data management?
[00:02:23] Unknown:
Probably, it has been seamless, and, probably in the early nineties, been, going through, the database to the database analytics to data management to data integration. Keep in mind that, even, within LexisNexis, we started the HP systems platform back, earlier before year 2000. So back then, we already had data management, challenges with traditional platforms, and we started with this, and I've been involved since I joined the company in 2002, just with HPC. But before, I've been in data management for for a very long time. And so for the HPCC
[00:03:00] Unknown:
system itself, can you talk through some of the problems that it was designed to solve and some of the issues that you were facing at LexisNexis that led to its original creation? Oh, absolutely.
[00:03:11] Unknown:
So, in LexisNexis Resolutions, we started with risk management as our core competency back in the, mid nineties. And as, as we, go into risk management, 1 of the core assets when you are trying to assess risk and and predict, outcomes is data. Even before people spoke about big data, we had a significant amount of data, mostly structured. Some unstructured data too, but the vast majority structured. And we used to use the traditional platforms out there, whatever we could get our hands on. And, again, this is all back in the day before, Hadoop and before MapReduce was applied as a distributed paradigm for data management or anything like that. So, databases, Sybase, Oracle, whatever it was, Microsoft SQL, data management platforms, Avinisho, Informatica, whatever was available at the time. And, certainly, the biggest problem we had was a scalable well, it was twofold. 1 was the scalability.
All of those, solutions typically run-in a single system, so, there is a limit to how much bigger you can go vertically. And, certainly, if you are trying to also consider cost affordability of the system, that limit is, is much lower as well. Right? There is a point where you go beyond what the commodity system is and, and you start paying a premium price for whatever it is. So, that was the first piece. So, 1 of the 1 of the attempts at solving this problem was to, split the data and use different systems. But splitting the data creates also challenges around data integration. If you're trying to, link data, surely, you can take a, the traditional approach, which is, you segment your data into tables and, you put those tables in different databases and, then, use some sort of foreign key to join the data. But that's all good and dandy as long as you have a foreign key that is unique and that is reliable.
And that's not the case with data that you acquire from the outside. If you generate the data, you can have that. If you bring the data from the outside, you might have a record that says these records about John Smith, and you might have another record, that says these records about John Smith. But do you know for sure if these 2 records are about the same John Smith? And that's, that's a linking problem. And, the only way that you can do linking effectively is to put all the data together. And now you have we have this, particular issue where, in order to scale, we need to segment the data in order to be able to do what we need to do. We need to put the data in the same data lake as a term came later. We used to call this data land. Eventually, we changed the term in the late 2000 because data lake become became more more well known. So, at that point, the potential, paths to, overcome the challenge where well, we either, split all of the data as we were saying before, and then we come up with some sort of a meta system that will leverage all of these 3 data stores.
And, potentially, when when you're doing probabilistic linkage, you have problems that are in of the, computational complexity, o square or or worse. So, that means that, we will pay a significant price in performance, but, potentially, can be done if you have enough time and your systems are big enough and you have enough bandwidth in between the systems. But the complexity that you're gaining, from a programming standpoint is also quite significant, and sometimes you don't have enough time. Sometimes you get data updates that are maybe hourly or daily.
And, doing this, big linking process might take you weeks or months if you are doing this across different systems. So, and the complexity in programming this is also pretty significant factor to consider. So, at that point, we thought that maybe a better approach to this was to create, sort of, an underlying platform to apply these type of solutions to problems, these algorithms, in a divide and conquer type of approach. We, would have something that would, partition the data automatically, and that would distribute the data, in partitions into different commodity computers. And then we would add an obstruction layer on top of it that would create a programming interface that gave you, the appearance that you are dealing with a single system, with a single data store. And, whatever you coded for that data store would be, automatically distributed to the, underlying partitions.
We would also because of the way the hardware was far slower than it is today, we, thought that a good idea would be to move also as much of the algorithm as we could to those partitions rather than executing this centrally. Instead of bringing all of the data to a single place to process this, which the single place might not have enough capacity, would be to, do as much as you can. For example, a pre filter operation or or a distributed grouping operation or or a distributed filtering operation across each 1 of the partitions. And, eventually, once you need to do the global aggregation, you can do it centrally, but now with a far smaller dataset that is already pre filtered. And, the time came to define how to, build the abstraction layer. The 1 thing that we knew about was SQL as a programming language, and we said, well, this must be something that we can tackle with SQL as a programming interface for our data analysts.
But they were quite used quite used to a data flow model because of the type of tools that they were using before. Things like some for example, I have an issue where, the data flows are these diagrams where your notes are the operations, the activities that you do in the data, and the lines connecting the flow, the the activities represent the data traversing those. So we thought that a better, approach than SQL would be to create a language that gave you the ability to build this sort of data flows, in that system. And that's how ECL was born, which is the language that that runs on top of HVDC and HVDC.
[00:09:05] Unknown:
So it's interesting that you had all of these very forward looking ideas in terms of how to approach data management well in advance of when the overall industry started to encounter the same types of problems as far as the size and scope of the data that they were dealing with that led to the rise of the Hadoop ecosystem and the overall ideas around data lakes and MapReduce and some of the new data management paradigms that have come up. And I'm wondering what the overall landscape looked like in the early days of building the HPCC system that required you to implement this in house and some of the other systems or ideas that you drew on for inspiration for some of these approaches towards the data management and the overall systems architecture for HPCC?
[00:09:52] Unknown:
That is a great question. So, it's interesting, because in the early days, when we told people what we were doing, they would look as puzzled and ask, well, why don't you use database x y z or data management system x y z? The the reality is that none of those would be able to cope with the type of, data frequent data process. They they wouldn't offer the flexibility of the process like this probabilistic record linkage that we that I explained before that we do. And, certainly, we're gonna offer a seamless transition between data management and data querying, which was also 1 of the important requirements that we had at the time.
It was quite difficult explain to others why we were doing this, and what we were gaining by doing this. So map and reduce operations as, as functional programming operations have been around for a very long time since the, list list days in the fifties. But, the idea of using map and reduce as operations for distributed data management didn't get published until this, I think, was September or December 2004. And I remember reading the original paper from the Google researchers and, thinking, well, now someone else has the same problem, and they are doing something about it. Even though at the time, we were already we already had HPC and we already had ACL. So it was, perhaps too too late to go back and try to reimplement the, the data management aspects and the and the programming layer abstraction, on HPC. So, just for those people in the audience that don't know, much about ECL. And, again, it's this is all open source. So open source and free Apache to license, and there are no no strings attached. So please go there and look at it. But in in summary, ECL is declarative data flow programming language and not unlike, the declarative manner, what you can find in an in SQL or functional programming languages, Haskell, maybe ways of programming list and and, closure and and other programming languages out there. But it's data flow. From that standpoint, it's closer to something like TensorFlow. If you are familiar with, TensorFlow as a deep learning, programming paradigm and framework. So, where you code data operations, there are primitives, that are data primitives, like, for example, sort. You can say sort dataset by this column, in this order, and then you can add more modifiers if you want. You can, do a join across datasets. And, again, the operations join. And you can do a roll up operation, and the operation's name roll up. All of these are high level operations.
You define them in your program. And, in a declarative programming language, you, create definitions rather than assign variables. For those that are not familiar with declarative programming, and I assume many are in this audience, declarative programming, has, for the most part, the property of having immutable data structures, the which, doesn't mean that you cannot do valuable work and you can do all of the work the same way or better. But it gives, get gets rid of side effects and other pretty bad issues with more traditional mutable data structures. So you define attributes. You define things. I, I have a dataset that has a phone book, and I want to define an attribute that is, this dataset filtered by, a particular variable. And then I might define another attribute that uses the filtered dataset to, now group it in a particular way. So, at the end of the day, an ECL program is just a set of definitions that are compiled by an ECL compiler, and this, compiles ECL into c, which then or in reality c plus plus, which then goes into the c plus plus compiler of the system, GCC or Clang or whatever you have, and generates, assembly code. And that that is the code that is run-in the platform. But the fact that, ECL is such a high level programming language and the fact that it's declarative means that the ECL compiler can take decisions that, otherwise, more imperative type of, programming language wouldn't allow the compiler to take. The compiler in declarative programming language, in functional programming languages is also the same case, knows the ultimate goal of the program because the program, in some ways, isomorphic to an equation. And, you could in line, from a functional standpoint, every 1 of your statements to a single massive statement, which you, of course, don't do, from a clarity standpoint.
But the, compiler can now, for example, do things like, apply non strictness. If you put a statement, if you made a definition that is never going to be used, there is no point for that definition to be even compiled in or executed at all. That saves performance. You could, if you have a conditional fork in a place in your in your code, but that condition is always met or never met, then there is no need to compile the other branch. All of this, gives you performance implications that can be far more significant when you're dealing with big data. 1 of the particular optimizations can be around data encapsulation.
It is a lot far, it's far more efficient and it's a lot, faster if you, are going to do several operations to through a very large dataset to combine all of those operations and do only 1 pass in the data with all the operations if it's possible. And the compile the ACL compiler does exactly that and, takes away a little bit of the, perhaps, flexibility on the programmer by, making it far more intelligent at the moment it's, it's compiled. Of course, parameters can, tell the compiler I know better and force it to do something that might be otherwise unreasonable. But just an example, you could say, well, I want to, sort this dataset, and I then I want to filter it out and get only these few records.
And if you say that in that order, a an imperative programming language would first sort and sort even in the optimal most optimal case is, is an n log n type of operation and computational complexity. And then filter out and get only a few records out of it when the, optimal, situation would be to filter it out first and get those few records and then sort those records. And the ECL, compiler does exactly that. The fact that
[00:16:03] Unknown:
the language that you're using for defining these manipulations ends up being compiled, and I know that it's implemented in c and c plus plus both the ECL language itself as well as the overall Hpcc platform is definitely a great opportunity for better performance characteristics. And I know that in the comparisons that you have available for going between Hpcc and Hadoop. That's 1 of the things that gets called out. And as far as the overall workflow for somebody who is interacting with the system using that ECL language. I'm curious if the compilation step ends up being in any way a not a hindrance, but a delaying factor as far as being able to do some experimental iteration or if there is the capability of doing some level of interactive analysis or interaction with the data for being able to determine what is the appropriate set of statements to be able to get the desired end result when you're building an ECL data flow? That is another great question. I I can see that you are quite versed in in programming. So,
[00:17:13] Unknown:
you're right. The fact that ACL is compiled means that and and just to, again, for for the rest of the audience, we, have an integrated development environment called ECID. And, of course, we support others like Eclipse and Visual Studio and and all of the standard ones. But, I'll just talk about the ECLID because it's what, I mostly use. In that case, when you, write code, you write the ECL code, and then you do you can certainly, run the test, of the code, verify that that code is is correct syntactically. But at some point, you want to run the code because you want to get a you didn't you want to know if, semantically, it makes sense and it will give you the right results. Right? So certainly, the completion process can take longer. Now the compiler does know what hasn't been modified. Remember, again, ACL is a declarative programming language. So if you have in touch, number of attributes, again, data structures are immutable, and attributes that don't change, since there are no side effects should be exactly the same. So the fact that when you define a function, that function has referential transparency. That means that, if you call that function, at any time, it would give you the correct result or the same result based on the parameter and just based on the parameter that you're passing. With that, the compiler can take some shortcuts. And if you are, recompiling, some bunch of ECL attributes, but you haven't touched many of them, it will just use the precompiled code for those and only compile those that you have changed. So the completion process when you are iterate iteratively working on code tends to be fairly quick, maybe a few seconds. Of course, you depend on having an ACL compiler available. Traditionally, we used to have a centralized approach to the ACL compiler where it would be 1 or a few of them running in the system.
We have moved to a more distributed model where when you deploy your ECL ID and your ECL tools in your, workstation, there's a compiler that goes there. So, the ECL compilation process can happen in the workstation as well, and that gives you, the ability to have it available for you at all times when you're trying to use it. The 1 of the bottleneck was at some point before when you were trying to do this quick iterative programming, approach to things. And, the compiler was being used by someone that was compiling a massive amount of physical code from some completely new job and might have taken minutes, and you were just there sitting, picking your nose, waiting for the compiler to to finish that 1 compilation.
But but the, time to compile is a is an extremely important, consideration, and, we continuously improve the compiler to make it faster. We we have learned, you can imagine, over a bit by the way, some of the same core developers have developed the, ECL compiler, Gavin Halliday, for example, have been with us, since the very beginning. They, he was 1 of the core architects behind the initial design of the platform, and he's still the lead architect that is, developing that ECL compiler, which means that a lot of the knowledge that has come into, into the compiler process and and and optimizing it is still, getting better and better. Of course, now with the larger community, working on the compiler and, and and more people involved and more documentation around it means that others can pick up where he lives. But, hopefully, he will be around and doing this for a very long time. And, and making sure that the compiler is as just in time as it can be is, is very there is no, at this point, interpreters for ACL, and I think it would be quite difficult to, make it, completely interactive where the point where you submit, just a line of code and does something because, of the way, a declarative,
[00:21:16] Unknown:
programming paradigm works. Right? And, also, because of the fact that you're working most likely with large volumes of data distributed across multiple nodes, being able to do a ripple driven development is not really very practical or, and it doesn't really make a lot of sense. But the fact that there is this fast compilation step and the ability to have a near real time interactivity as far as seeing what the output of your programming is, it's good to see particularly in the big data landscape wrap and then see what the output was before you could then take that feedback and wrap that into your next attempt. And that's why there has been so many different layers added on top of the Hadoop platform in the form of Pig and Hive and various SQL interfaces to be able to get a more real time and interactive and iterative development cycle built in. Yeah. And you're you're absolutely right there.
[00:22:18] Unknown:
Now 1 thing that I haven't, told the audience, yet is how the platform looks like. And and I think that's this we are getting to the point where it's quite important to, explain that there are 2 main components in the HP systems platform. There is 1 component that does data integration. There's this massive data, management engine equivalent to your data lake management system, which is called Thor. Thor is meant to run 1 TCL work unit at a time, which a work unit can, consist of a large number of operations, and many of them are done in parallel, of course. And there is another 1, which is known as Roxy, which is the data delivery engine. There is 1 which is a sort of a hybrid called a h 4. Now Roxy and h 4 both, are designed to operate in, tens of 1, 000 or more operations at the same time simultaneously.
4 is meant to do 1 work unit at a time. So when you are developing on 4, even though your completion process might be quick and you might run on a small datasets quickly because you can execute this work unit on those little datasets using, for example, h Thor. If you are trying to do the data large data transformation over large datasets in your Thor system, you still need to go to the queue in that store, and you will get your time, whenever it's due for you. Right? Surely, you can we have priority, so you can, jump into a higher priority queue, and maybe you are you can be queued, after a a the just the current job, but before any other future jobs, we also partition jobs into, a smaller units, and those smaller units, can be also segmented. They are fairly independent from each other.
So we could even interleave some of your your jobs into in between, a job that is running by getting into each 1 of those, segments of the of the work unit. But still, the interactiveness there is a little bit less than than optimal, but it is the nature of the basis because you want to have a large system to be able to process, this, throughout all the data in a relatively fast manner. If we were trying to truly multiprocess there, most likely, many of the resources available could, suffer. So you might end up paying a significant overhead across all of the process that are running in parallel. Now I did say that Thor ran only 1 work unit at a time, but that was a little bit of a lie. That was really a few years ago.
Today, it does run you you can define multiple queues in a store, and you can make it run, 3, 4, 10 work units, but certainly not thousands of them. So, that's a that's a big difference between that and Roxy. Can you run your work unit in Roxy? Yes. Or in h store, and that will run concurrently with anything else that is running, with almost no, limit there. 1000 and thousands of them can run at the same time. But there are other considerations on when you run things on Roxy or h Thor versus in Thor. So,
[00:25:27] Unknown:
it might not be what you really want. Taking that a bit further, can you talk through a standard workflow for somebody who has some data problem that they're trying to solve and the overall life cycle of the information as it starts from the source system, gets loaded into the storage layer for the HBCC platform. They define an ECL job for it that then Thor, and then being able to query it out the other end from Roxy, and just the overall, systems that get interacted with at, and just the overall, systems that get interacted with at each stage of that data life cycle?
[00:26:02] Unknown:
Oh, I'd love to. So, very well, let's let's set up something very simple as an example. You have a number of, datasets, that are coming from the outside. You need to load those datasets into, HPC. So the first operation that happens is something that is known as spray. Spray is, simple process is, and the spray comes from the, concept of spray painting the data across the cluster. Right? So, this runs on a Windows box or on a Linux box, and it will take the dataset. Let's say that your dataset is, just give a number, a 1000000 records long. It will, and, usually, as it can be in any format, CSV or or any other or or fixed length, delimited, or whatever.
So, it will look at your, the total dataset. It will look at the size of the 4 cluster where the data will reside initially for processing. And let's say that you have a 1, 000, 000 records in your dataset and you have, 10 nodes on your 4. Let's just make round numbers and and small numbers. So, it will partition the dataset into 10 partitions because you have 10 notes, and it will, then just copy, transfer each 1 of those partitions to the corresponding to store node. This is done, if it can be parallelized in some way because, for example, your data is fixed length, it will automatically use pointers and parallelize this. If the data is, I don't know, an XML format or, the limited format where it's very hard to find the partition points, it will need to do a pass on the data, find the partition points, and eventually do the parallel copying to, the Azure system.
So now you will end up with 10 partitions of the data with the data in no particular ordering other than the ordering that you had before. Right? So the first 100, 000 records will go to the first node. The second 100, 000 records will go to the, second node and so on and so forth until you go to the end of the dataset. This puts each 1 of the nodes, in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is sprayed or or while the data has been sprayed, depending on the length of the data, or or even before, yeah, you will most likely need to write a work unit to work on that data. And I'm trying to, do this example in a way that is, the data the first time you see that data so, otherwise, all of this is automated. Right? So you don't need to do anything manually. All of this is scheduled and automated, and the work unit that you already have will run on the new dataset and will append it or whatever needs to be done. But let's imagine that it's completely new. So now you write your work unit, and let's say that your dataset was a phone book and you want to, first of all, eliminate duplicates, build some roll up views on that phone book. And, eventually, you're going to, allow the users to run some queries on a web interface to, look up people in the phone book.
So you, and, let's, just for the sake of an argument, let's say that you are also trying to join that phone book with, your, customer contact information. So, you will write the work unit that will have that join to, merge those 2, and, you will have some duplication and and perhaps some sorting. And, after you, have that, you will need to build. You will want you don't need to, but you will want to build some keys. There is another, again, key build processing for the all these runs on 4 that will be part of your work unit. So, essentially, it's all ECL.
Write that work unit in ECL, submit that work unit, The ACL will recompile, will run on your data. Hopefully, the ACL will be synthetically correct when you submit it, and, it will run with giving you the results that you were expecting on that data. ECL is, I did mention this before, but it's a statically typed language as well, which means that it is a little bit harder to have errors that will only appear in run time. Between the fact that it has no side effects and that it's statically typed, most, typing errors type errors, type matching errors, and type matching to function operations errors are a lot less frequent there. It's not like Python, but you it might seem okay.
The run might be fine, but then 1 run at some point, it will give you, some weird error because a a variable that was supposed to have a piece of text has a number or vice versa. So, you run the work unit. The work unit will give you the result. As a result of this work unit, it will give will potentially give you some statistics on the data, some, metrics, and it will give you a number of keys. Those keys will be also partitioned in 4, so there will be a filtered node. The, keys will be partitioning 10 pieces in those nodes, and, you will be able to query those keys as well from. So you can write a few attributes that can do the query in there. But at some point, you will run to you will want to write those queries for Roxy to use, and, you will want to put that data in Roxy because you don't have 1 user querying the data. You will have, a 1, 000, 000 users going to query that data, and perhaps 10 1, 000 of them will be simultaneously querying. So, for that process, you write another piece of ECL, another, sort of work unit, but we call this query. And you submit that to Roxy instead of 4.
And there is a, slightly different way to submit it to Roxy. So you select Roxy and you submit this. The difference between this query and the work unit that you hit you had in 4 is that the query is parameterized. And similar to parameterized procedure in your database, you will define some variables, that are supposed to be coming from the front end, from the input from the user. And then, you just use the values in those variables to, run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from Thor. I said before, the keys are not mandatory. Roxy can perfectly work without keys, and it even has a way to, work with, in memory distributed datasets as well. So even if you don't have a key, you don't pay a significant price, in the lookups, by doing this sequential lookup on the data and the full table scans of your database. So you submit that to Roxy. When you submit that query to Roxy, Roxy will realize that it has the data is not in Roxy. It's in Thor. And, this is also your choice, but most likely, you will, just tell Roxy to load that data from Thor. It will know where to load the data from because it knows where the keys are and what the names of those keys are. It will automatically load those keys. And, also, it's your choice to tell Roxy to start allowing users to query the front end interface or to while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically expose on the front end, there is a component called ESP. That component, called ESP exposes a web service interface.
And this gives you a RESTful interface, a SOAP interface, JSON for the payload if you're going for the RESTful interface, even, an ODBC interface if you want. So you can have even a even SQL on front on the front end. So the moment you submit the query, the query automatically generate auto generates this, all of these web service interfaces. So, automatically, if you want to go with a web browser on the front end or if you have an application that can use, I don't know, a RESTful interface over HTTP or or or HTTPS, you can use that, and it will automatically have access to that, Roxy query that you submitted. Of course, a single Roxy might have not 1 query, but a 1, 000 different queries at the same time.
All of them, listening on interface, and it can have several versions of the of the queries, as well. The queries are all exposed version from the front end, so you know what the users are accessing. And if you are deploying a new version of a query or modifying an extreme query, you don't break your users if you don't want to. You give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process now. As you, add complexity to this, well, you need to have automation. All of this can be fully automated in ECL. You might want to have data updates.
And I told you data is immutable. So every time you think you're mutating data, you're updating data set, you're really creating a new data set, which is good because it gives you full provenance. You can go back to your every data version. Of course, at some point, you need to delete data or you will run out of space, and that can be also automated. And if you have updates on your data, we, have concepts like super files where you can apply updates, which are essentially new, overlays on the existing data. And the existing work unit can just work on that happily as if it was a single dataset. So a lot of these, complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So, the, developers, if they don't wanna see the underlying complexity, they don't need to. If they do, they have the ability to do that. I mentioned before, well, ECL will optimize things.
So it if you tell it, do this, join, but before doing the join, do the sort, well, it might know that it needs to sort, so the sort won't be done. But if you know that your data is already sorted, you might say, well, let's not do the sort. I want to do this join, in each 1 of the partitions, locally instead of a global join and or the or the same thing with the sort operation. And ECL, of course, if you tell it to do that and you know better than, than the system, ECL will follow your orders. If not, it will take the safe approach to your operation even if it's a little bit more overhead, of course. A couple of things that I'm curious about out of this are
[00:35:52] Unknown:
the storage layer of the HPCC platform and some of the a being able to take backups of the information, which I know is something that is nontrivial when dealing with large volumes. And also on the Roxy side, I know that it maintains an index of the data, and I'm curious how that index is represented and maintained and kept up to date in the overall life cycle of the platform?
[00:36:24] Unknown:
Those are also very good questions. So in the case of Thor, Thor, has a concept so we need to go down to a little bit of the system architecture. So in Thor, you have, each 1 of these nodes that handle, primarily, their chunk of data, their partition of the data. But there is always a body node, some other node that has also their own partition, but they, have a copy of the partition of some other node. If you have 10 nodes in your cluster, your your node number 1 might have the first partition and might have a copy of the partition that node 10 has. Node number 2 might have, the partition number 2, but also might have a copy of the partition that numb node number 1 has. And, so on and so forth, every node would have 1 primary partition and 1, backup partition of the other nodes.
Every time you, run a work unit, you said that you, the data is immutable, but you are generating a new dataset every time that you're materializing data on the system, either by forcing it to materialize or by, letting the system materialize the data when it's necessary. And and the system tries to stream as much in this, way similar more similar to Spark or or TensorFlow where the data can be streamed from activity to activity without being materialized unlike map reuse. And at some point, it decides that it's the time to materialize because the next operation might require materialized data or because, you've been going for too long with data that, if something goes wrong with the system, will be blown up. Where every time it materializes data, the, there is a lazy copy happening with the new data has materialized to this backup node. So, surely, there is there could be a point where something goes very wrong and 1 of the nodes dies and the data in that disk is corrupted. But you know that you have always another node that has that copy and the moment you replace, you do what is, known as kung fu, essentially. Pull it out, put another 1 in, the system will automatically rebuild that missing partition because it has a complete redundancy of all of the data partitions in, always in a different node. In the case of Roxy, so in the case of 4, this seems to be, sufficient. There is, of course, the ability to do backups, and you can back up all of these partitions, which are just files in the Linux file system. So you can even back them up using any Linux backup utility, or you can, use HPC to do the backups for you, into any other system. You can have cold storage. Some of the problems is what happens is when your, data center is compromised, and and now someone modified or destroyed the data in live system. So you may have a may want to have a some sort of offline backup, And you can all handle this in the normal, system backup configuration, or you can do it with HPC and and make it offload the data as well.
But for Roxy, the redundancy is even more critical. In the case of 4, when no dies, it is sometimes less convenient to let the system work in a degraded way because the system is typically as fast as the slowest node. If all nodes are doing the same amount of work, a process that takes an hour will take an hour. But if, you happen to have 1 node that died, then now there is 1 node that is doing twice the work because it has to do deal with 2 partitions of data, its own and the backup of the other 1, the, time to the process might take 2 hours.
So it is more convenient to just stop the process when something like that happens, replace the nodes, and let the system rebuild that node quickly and continue doing the processing. And that might take an hour and 20 minutes or 10 minutes rather than the 2 hours that otherwise it would have taken. And besides, if a system continues to run and your your storage system died in 1 node because it's old, then there is a chance that other storage systems, when they get under the same stress, will die the same way. You wanna replace that 1 quickly and have a redundant copy as soon as you can. Do not run the reset. You lose 2 of the, of the partitions. And, if you lose 2 partitions that are in different nodes that are not the backup of each other, that's fine. But if you lose the primary node and the backup node for that 1, there is a chance that you might, end up losing the entire partition, which is which is bad. Again, bad if you don't have a backup. And even restoring a backup sometimes takes time, so it's it's also inconvenient.
Now on the ROCE case, you are you have a far larger pressure to have the process continue because your Roxy system is typically exposed all to online production customers that might pay you a lot of money for you to be highly available. So, Roxy allows you to have defined the amount of redundancy that you want, based on the number of copies that you want. You could say, well, I have a 10 old Roxy, and I just need, which is the default, 1 copy of the data, or I need 3 copies of the data. So maybe, the copy the partition in node 1 will be will have a copy in node 2, node 3, and node 4, and so on and so forth. Of course, you need 4 times the space, but, you have a far higher resilience if if something goes very wrong. And roxy will continue to work even if a node is down or 2 nodes are down or or as many nodes as you want are down as long as you have the data is still fine. Of course, worst case scenario, if even if you do some partition completely, roxie might, if you want, it might continue to run, but it won't be able to answer any queries that are trying to leverage that particular partition that is gone, which is sometimes not a good situation. When you you ask about the format on the keys, and the formatting the of the keys of the indexes in Roxy is, interesting because, those keys, which is, again, typically the format of the data that you have in Roxy, for the most part, you will have a primary key. These are all keys that are multi field, like in a normal, decent database out there.
So they have multiple fields. They, go typically, they those fields are ordered by cardinality, so the fields with the larger cardinality will be at the front to make it more better performing. It has, interesting abilities. Like, for example, you can step over a field that you don't have, you have a wildcard for and still use the remaining fields, which is not something that normally a database doesn't do. Once you have a field that you don't have a value to apply, the rest of the fields on the right hand side are useless. So Ruboxy has other things that are quite interesting there. But the way the data is stored in those keys is by decomposing those keys into 2, components. There is a top level component that indicates which node will have that partition, and there is a bottom level component which indicates where in the hard drive, the of that of that, node the specific data element is or the specific, block of data elements are. So, by decomposing the keys in these 2 hierarchical levels, it means that every node in Roxy can have the top level of that, which is very small. So every node can know where to contact the specific values. So every node can be queried from the front end. You have now a good scalability on the front end. You can have a load balancer and load balance all of the nodes. And still, on the back end, they can go back and know which node to ask for this. When I said that the bottom level has the specific partition, I lied a little bit because it's not the node number. Roxy uses multicast.
So nodes, when they have a partition of the data, they subscribe to multicast channel. What you have in that top level is the number of the multicast channel that will handle that partition. That allows us to, make, Roxy nodes more dynamic and handle, also the, fault tolerance situations where nodes go down. Well, it doesn't matter. If you send the message to a multicast channel, any node that is subscribed to a multicast channel will get that message. Which 1 to respond? Well, it will be the faster node, the node that is less burdened by other queries, for example. And if any node dies in that channel, it really doesn't matter. You're not stuck in a TCP connection waiting for the handshake to happen because the node went away. It is UDP. You send the message, and you will get the response. And, of course, if nobody responded in a reasonable amount of time, you can resend that message as well. Going back to the
[00:44:55] Unknown:
architecture of the system and the fact that how long it's been in development and in use and the massive changes that have occurred at the of changes that have occurred at the industry level as far as how we approach data management and the uses of data and the overall systems that we might need to integrate with. I'm curious how the HPCC platform itself has evolved accordingly and some of the integration points that are available for being able to reach out to or consume from some of these other systems that organizations might be using? We have changed quite a bit. So even though the HPC system's
[00:45:29] Unknown:
name, and some of the code base, is resembles what we had 20 years ago, as you can imagine, any, piece of software is a living living entity, and it changes and evolves and and adapt, of course, as long as the community is active behind it. Right? So, we have changed significantly. We have not just added functionality, core functionality to HPC or change the functionality that he had to adapt to times, but also, build integration points. I mentioned Spark, for example. And, Spark, even though HPC is very similar to Spark, Spark has a large community around machine learning. So, it is useful to integrate with the Spark because many times people might be using Spark ML, but they might want to use HPC for data management And having a proper integration where you can run a Spark ML on top on top of HPC is, something that can be attractive to a significant amount of the HPC open source community. Other cases, like, for example, Hadoop and HDFS access are the same way.
Integrations with other programming languages. Many times people don't feel comfortable programming everything in ECL. And ECL works very well for data manage something that is a data management centric process. But sometimes you have little, components on that process, for example, that cannot be easily expressed in ECL or at least not in a way that is efficient. I don't know. I'll just throw 1. Let's say that you need to generate unique unique IDs for things and you want to generate these u these unique IDs in a random manner like UUIDs. Surely, you can call this in ECL.
You can come I can come up with some crafty way of doing it in ECL, but it would make absolutely no sense to code it in ECL to then be compiled into some big chunk of c plus plus when I can code it directly in, c or c plus plus or Python or Java or JavaScript. So being able to embed all of these languages into ECL became quite important. So, we built quite a bit of integration for embedded languages. It's back even a few major versions, ago a few years ago. We had support for, I mentioned some of these languages already, Python and, Java and JavaScript. And, of course, c and c plus plus was already available before.
So people can, add these little snippets of functionality, create attributes that are just embedded language type of attributes, and those are exposing ECL as if they were new ECL primitives. So now they have the ability of this expandability of the core language to support new things, without needing to write them in ECL natively every time. And other there are plenty of other, enhancements as well on the front end side. So I mentioned ESP. ESP is this front end access layer. Think of it as some sort of message bus in front of your Roxy system. In the past, we used to require that you code your, ECL query for Roxy, and then you need to and, ESP is also coded in c plus plus. So you need to go to ESP and, extend ESP with the dynamic module to support the front end interface for that query, which is twice the work. And you require someone that also knows c plus plus, not just someone that knows ECL. So we, changed that, and we use something now that is called dynamic ESDL that auto generates, as I mentioned before, these interfaces from ESP.
As you code these queries in, ECL, all they want they'll expect that you will query the you will call the query with some parameterized, interface to the query. And then, automatically, ESP will take those parameters and expose those in this front end interface for for users to consume directly. We also have done quite a bit of integration in systems that do, that can help with the benchmarking of HPC, availability, monitoring, and and performance monitoring. All of the capacity planning of HPC as well. So we are we try to integrate as much as we can with other, components in the open source community.
We we truly love open source projects. So if there is a project that already has done something that we can leverage, we, try to stay away from, reinventing the wheel every time we use it. If it's not open source, if it's commercial, we do have a number of integrations with commercial systems as well. We are not too we are not religious about it, but, certainly, it's a little bit less enticing to put the effort into something that is closed source. And and, again, we we we believe that the model, in open source is a is a is a very good model because it gives us it gives you the ability to know how things are done under the hood and extend and and fix them if you need to. We do this all the time with our projects. We we believe that it has a significant amount of value for for anyone out
[00:50:26] Unknown:
there. On the subject of the open source nature of the project, I know that it was released as open source in, I think you said, the 2011 time frame, which postdates when Hadoop had become popular and started to accrue its own ecosystem. I'm curious what your thoughts are on the relative strengths of the communities for Hadoop and Hpcc currently, given that there seems to be a bit of a decline in Hadoop itself as far as the amount of utility that organizations are getting from it. But, also, I'm interested in the governance strategy that you have for the HPCC platform and some of the ways that you approach sustainability of the project.
[00:51:08] Unknown:
So you're absolutely right. The, handoff community has apparently, at least reached a plateau. It is far larger than HPC Systems community, in number of of people. Of course, it was the first to the open so we had HPC for a very long time. It was closed source. It was proprietary, and we didn't, at the time, we believed that it was so, core to our, competitive advantage that we, couldn't afford to release it in any other way. When we realized that reality, the, core advantage that we have is, on 1 side, data assets, and on the other side is the high level algorithms. We, knew that, the platform, would be better sustained in the long run. The sustainability is an important factor for the platform and for us because the platform is so core to everything we do that we believe that making it open source and free, completely free, as both a not just a, freedom of speech, but also free beer.
We we thought that that would be the way to ensure this, long term survivability and and and development and and and expansion and innovation in the platform itself. But when we did that, it was 2011. So it was a few years after Hadoop. Hadoop, if you remember, it started as part of another project around the, web crawling and and and web crawling management, which eventually, ended up it's as its own top level Apache project in 2008, I believe. So it was already 3 or 3 and a half years after Hadoop was out there, and the community was already large. There so over time, we did, gather a fairly active community. And today, we have an active very technical deeply technical community, that that not just helps with extending and expanding HPC, but also provides ideas, use cases, sometimes interesting use cases of HPC, and and uses HPC in general and regular regularly.
So, the HPC systems community continues to grow. The Huddl community seems to have reached the plateau. Now there are other communities out there which also handle some of the data management aspects with their own platforms, like Spark, I mentioned before, which seems to have a better, performance profile than what Hadoop has. So it has been also gathering active, active people in those communities as well. I think open source is not a 0 sum game where if a community grows, the other 1 will decrease, and then eventually, the total number of people in the community will be the same across all of them. I think, every, new platform that introduces capabilities to open source communities and and and introduces new ideas and and and helps develops, apply innovation into those ideas, is helping the overall community in general. So, it's great to see, communities like the Spark community growing, and I think there's an opportunity. And many of the users in both communities are are, using both at some point, for all of them to leverage, what is done in the others. Surely, sometimes, the specific language used in coding the platforms, makes a little bit of a creates a little bit of a barrier.
Some of these communities are now just because of the way Java is, potentially more common, that use Java instead of c plus plus or c. So you see that, sometimes the people that are in 1 community who might be more versed in Java feel uncomfortable going and trying to understand the code in the other, platform that is coded in a different language. But even, even then, at least some general ideas in the algorithms and the, and the, functions and capabilities can be extracted and used in the other. And I think this is good for the overall benefit of everyone. I see, in many cases, open source as an as an experimentation, playground where people can go there, can, bring new ideas, apply those ideas to some code, and then everyone else eventually leverages them because these ideas percolate across different projects.
It's it's it's quite interesting, having been involved personally in open source since the early nineties, I, I I'm quite fond of the of the process open source works. I think it's it's, it's beneficial to everyone in the in in every community.
[00:55:38] Unknown:
And in terms of the way that you're taking advantage of the HPCC platform at LexisNexis and some of the ways that you have seen it used elsewhere, I'm wondering what are some of the notable capabilities that you're leveraging and some of the interesting ways that you've seen other people take advantage of it? That's a that's a good question, and,
[00:55:56] Unknown:
that might the answer might take a little bit longer. So, in the in LexisNexis, in particular, certainly, we use HPC for almost everything we do because almost everything we do is data management in some way and data query. Now, we have interesting approaches to things is we have a number of processes that are done, on on data. 1 of those is this probabilistic linkage process. And probabilistic linkage, requires sometimes quite a bit of of code to make it work correctly. So, there was a point where we were building this in ECL, and it was creating a code base that was getting more and more, sizable, larger, bigger, less manageable.
So at some point, we decided that, that level of obstruction that is pretty high anyway in ECL, wasn't enough for probabilistic data linkage. So we created another language. We called it salt, and we didn't release that language as open source, by the way. It's still proprietary. But, that language is a language that is, you can consider it a domain specific language for data linkage, probabilistic data linkage, and and data integration. So that, compiler for salt compiles salt into ECL. And the ECL compiler compiles ECL into c plus plus, c plus plus, Clang, or or GCC compiled into assembler.
So you can see how abstraction layers are like layers in an onion. Of course, every time you apply an improvement, an optimization in the ACL compiler or sometimes the GCC, compiler team applies an optimization in GCC, Everyone else on top of that of that layer benefits from the optimization, which is quite interesting. We liked it so much that, eventually, we had another problem, which was dealing with graphs. And when I say graphs, I mean social graphs rather than, charts. So we built yet another language, that deals with graphs and machine learning and particularly machine learning in graphs, which is called CAL or knowledge engineering language, which, by the way, we don't have an open source version, but we do have a free version of the compiler, out there for people that want to try it. So KEL also generates ECL, and ECL generates c plus plus, and, again, back to the to the same point. So this is a it's an interesting, approach to building abstraction by, creating DSLs, domain specific languages on top of ECL. Another interesting, application of HPC outside of LexisNexis is, there is a company that is, it's called GuardHat. They do, hard hats that are smart. They can do geofencing for workers. They can do a detection of, of of risky environments in, in manufacturing environments or or in construction.
So they use HPCC, and they use some of the real time integration that we have with things like Kafka and CouchDB and other integrations. I mentioned that we have worked actively on integrating HPC with other, open source projects to essentially manage all of these data, which is fairly real time, and create these real time alerts and and real time, machine learning execution, for for models that they have and and and integration of data and even visualization on top of it. And and and there are more and more. I I could I could go for days, giving you some, some of the ideas there of of things that we have done and or and or others in the community have done, using HPC.
[00:59:22] Unknown:
And in terms of the overall experience that you have had working with HPC on both the platform and as a user of it, what have you found to be some of the most challenging aspects
[00:59:34] Unknown:
and some of the most useful and interesting lessons that you've learned in the process? That That is a great question, and I'll give you a very simple answer, and then I'll explain what I mean. The what are some of the biggest challenges if you're a new user, is ECL. Some of the biggest benefits are ECL. Unfortunately, not everyone is, is well versed in declarative programming models. So when you are exposed for the first time to a declarative programming language that has immutability and laziness and, and, no side effects, it makes sometimes a little bit of a of a brain twister, in some way. Right? You you get to you need to think the problems in a slightly different way to be able to solve them. When you, instead of used to imperative programming, you typically solve the problem by, decomposing the problem into a a just a recipe of things that the computer the this processor needs to do step by step, 1 by 1.
When you do the clarity programming, you decompose the problem in a set of functions that need to be applied, and you, build it from the ground up. It's a slightly different, type of, of of of of approach. But it, once you get the idea of how this works, it becomes quite powerful for a number of reasons. It becomes quite powerful because, first of all, you get to, understand the problem more, and you can express, the algorithms in a far more succinct way. Algorithms end up being just a collection of attributes, and some of the attributes depend on other attributes that you have defined.
It also helps you with better encapsulate the components in the problem. So now your code, instead of becoming, just some sort of spaghetti that is hard to troubleshoot, is well encapsulated both in terms of functional encapsulation and also data encapsulation. So, if you need to touch anything later on, you can touch it safely without needing to worry about what this function could be doing, that I'm calling here. Do I need to go and also look at that function? Because, you know, there are no side effects. And it also gives you the ability to, ease if you of course, as long as you name your attributes correctly so people understand what they they are attempting to do or they they are supposed to do, you can, collaborate more easily, with other people as well. So after a while, I, realized that I was building code in ECL and and others have also the same way.
Then realize that the code that they write in ECL is, first of all, mostly correct most of the time, which is not what you do when you have a, nondeclarity code programming. And that you know that if the code compiles, there is a high chance that the code will run correctly, and it will give you correct results after it runs. And, like, as I was explaining before, when you have a dynamically typed language is, in imperative programming, with side effects where, yeah, surely, the code might compile and maybe it will run fine a few times. But 1 day, it might give you some sort of runtime error because some type is mismatched or some side effect that you didn't consider when you, rearchitect some piece of the code, now is kicking in and and getting your your results different from what you expected.
I think, again, ECL has been, really quite, quite a blessing from that standpoint. But, of course, it does require that you learn this you want to learn and you learn this new methodology of programming, which could be similar to what someone that knows, Python or Java needs to learn in order to apply SQL. And SQL, again, it's another declarative language. So, you don't code SQL imperatively, when you are trying to query database.
[01:03:36] Unknown:
Looking forward in terms of the medium to long term as well as some of the near term for the HPCC platform itself, what do you have planned for the future technical capabilities and features, we tend
[01:03:54] Unknown:
so, from the technical capabilities and features, we tend to, have a a community road map of things and and try to as much as we can stick with those road maps, so we have some these big ideas that tend to get into the next or the following major version. These smaller ideas that are typical nondisruptive and and don't break past compatibility that go into the minor versions, and then, of course, these, bug fixes. And, like, many say, they are not bugs but opportunities. But, in the great ID in the big ideas, side of things, some of the things that we've been doing is doing better integration. As I mentioned before, integration with other open source projects is quite important. We've been also trying to change some of the underlying, components in the platform. There were some components that we, have had for a very, very long time, like, for example, the underlying communication layer, in Roxanne 4, that we think they might be, right now for a for a revamping by, incorporating some of the standard communication layers out there. There is also, the, idea of making the platform far more cloud friendly, even though it does run very well in any public cloud and and OpenStack and and Amazon and Google and and Azure.
But, we, want to also make the clusters more dynamic. I don't know if you spotted when I said that when you when I explained how you do data management with HVAC and I said, well, you have a 10 note Thor. Well, what happens when you want to change a 10 note Thor and make it a 20 note Thor or a 5 node 4? Maybe you have a small process that could work fine with just a couple nodes or 1 node, and you have a large process that might need a 1, 000 nodes. Today, you can't dynamically resize the full cluster. Truly, you can do every you can resize it by hand and then do a restitute of the data and now have the data in the number of nodes that you have.
But it is a lot more involved than we would like to see. With dynamic, cloud environments, the elasticities becomes quite important because that's 1 of the benefits of cloud. So, making the clusters also more elastic, more dynamic is another big goal. Certainly, we continue to develop machine learning capabilities on top of it. We have a library of of machine learning functions out there, algorithms, methods, and, we are expanding that. We sometimes have even some of these machine learning methods, which are quite, I would say, innovative. 1 of our core developers and and also researchers, developed a new distributed algorithm for, k means clustering, which she hasn't seen in the literature before. So it's part of a 1 1 paper and her PhD dissertation, which is very good. And the other 1 is also part of HVCC now. So, now people can leverage this, which gives a significantly higher scalability to k means, particularly if you're running a very large number of nodes. I don't wanna get into the details on how it is it it it creates this, this, far better performance.
But in, in sum, it distributes the data less, and instead distributes the centroids more. And it uses the associated property of of the k means, the main loop of k means clustering to try to minimize the number of, data records that need to be moved around. That's from the standpoint of the road map and the, platform itself. On the community side, we continue to try to expand the community as much as we can. 1 of our core interests is to get I mentioned this, core developer who is also a researcher. We want to get more researchers and an academia on the platform. We have a number of initiatives, collaboration initiatives with a number of universities in the US and abroad, universities like Oxford University in the UK, University College London, Humboldt University in, in Germany, and a number of universities in the US, like Clemson University, Georgia Tech, and then Georgia State University, and and I could so we want to expand this program more.
We also have an internship program. We believe that, 1 of the 1 of the things that and CRE goals that we want to achieve as well with with the HPC, systems project open source project is to also help, balance better the community behind it from, balancing diversity across the community. So attracting, both both gender well, in general, gender diversity and and and racial diversity and background diversity. So, we are trying to, also put quite a bit of emphasis in, students, even high school students. So we are doing quite a bit of activity with, high schools On 1 side, trying to get them more into technology and, of course, learn HPC. But, also, the other side, try to, also, get more women into technology, get more people that otherwise wouldn't, get into technology because they don't get exposed to technology in their homes. So that's another core piece of activity in HBC in the HBC community. And last but not least, as part of this diversity, there are certain communities that are a little bit more disadvantaged than others.
1 of those is people in the, autism spectrum. So we have been doing quite a bit of activity with, organizations that, are helping these, these people. So, also trying to enable them, with a number of activities. And some of those have to do with training them into, HPC systems as a platform and data management to give them open opportunities for them for their lives. Many of these, individuals are extremely intelligent. They're they're brilliant. They might have other limitations because of their, their conditions, but they would be very, very valuable resources, not just for solutions.
[01:09:44] Unknown:
Ideally, we could we would hire them, but, even for other organizations as well. Yeah. It's great to hear that you have all of these outreach opportunities as well for trying to help bring more people into technology as a means of giving back as well as as a means of helping to grow your community and contribute to the overall use cases that it empowers. So for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think there are a number of gaps, but the major 1 is, many of the platforms that are out there tend to be quite, clunky when it comes to integrating things. Unfortunately, we are at the point,
[01:10:30] Unknown:
where we are not I don't think we are mature enough. So, and and mature enough, I mean, if if you are a data management person, you know data very well. You know data analytics. You know data process, but you don't necessarily know operating systems. You don't know, you are not a computer scientist that can deal with, data partitioning and necessary for you to do your job should be unnecessary for you to do your your job correctly. But, unfortunately, today, because of the state of things, many times, many of these, systems, commercial and noncommercial, force you to take care of all of these details or assemble a large team of people from system to network network administrators to, operating system specialist to a middle layer specialist until you can build a system that can do do your data management.
The and that's something that we, we do try to overcome with HPC, giving this creating this homogeneous system that you deploy with a single command and that you can use a a minute later after you deployed it. I don't say that we are in the ideal situation yet. I think there is still much to improve on, but I think we are a little bit further along than many of the other options out there. You if you know the the Hadoop ecosystem, you know how many different components are there are out there. And, you know, if you have done this for for a while, you know that 1 day, you realize that there is an, I don't know, security vulnerability in 1 component in big. And now you need to, update that. But in order to update that, you're going to break, the compatibility of the new version with something else. And now you need to update that other thing, but there is no update for that other thing because that thing depends on another component. So yeah. And this goes on and on and on. So having something that is homogeneous that doesn't require for you to be a computer scientist to, deploy and use, and that truly enables you at the abstraction layer that you need, which is data management, is a is a significant limitation of many, many systems out there. And, again, not just pointing this at the at open source projects and also commercial projects as well. I think it's something that some of the people that are designing and developing the systems might not understand because they are not the users, but they should think as a user. You need to put yourself in the shoes of the user in order to be able to do the right thing. Otherwise, whatever you build is pretty difficult to apply, and sometimes it's even useless. Well, thank you very much for taking the time today to join me and describe the ways that HPCC
[01:13:10] Unknown:
is built and, architected as well as some of the ways that it's being used both inside and outside of LexisNexis. So I appreciate all of your time and all of the information there, and it's definitely a very interesting system and 1 that looks to provide a lot of value and capability. So I appreciate all of your efforts on that front, and I hope you enjoy the rest of your day. Thank you very much. I really enjoy this, and I look forward to doing this again. So 1 day, we'll get together again. Thank
[01:13:41] Unknown:
you.
Introduction to Flavio Villanustre and LexisNexis Risk Solutions
Challenges in Data Management and Creation of HPCC
Early Development and Industry Landscape
ECL Language and Compilation Process
Data Workflow in HPCC Systems
Storage and Redundancy in HPCC
Evolution and Integration of HPCC
Open Source Community and Governance
Future Plans and Community Outreach