Data Pipelines

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Summary

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC project and his work at LexisNexis Risk Solutions

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
    • What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
  • Can you describe the high level architecture of the HPCC platform and some of the ways that the design has changed over the years that it has been maintained?
  • Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
  • For someone who is using HPCC, can you talk through a common workflow and the ways that the data traverses the various components?
    • How does HPCC manage persistence and scalability?
  • What are the integration points available for extending and enhancing the HPCC platform?
  • What is involved in deploying and managing a production installation of HPCC?
  • The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
  • How does the Thor engine manage data transformation and manipulation?
    • What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
  • For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
  • How are you using the HPCC platform in your work at LexisNexis?
  • Despite being older than the Hadoop platform it doesn’t seem that HPCC has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC and how it compares to that of Hadoop over the past decade?
  • How is the HPCC project governed, and what is your approach to sustainability?
    • What are some of the additional capabilities that are only available in the enterprise distribution?
  • When is the HPCC platform the wrong choice, and what are some systems that you might use instead?
  • What have been some of the most interesting/unexpected/novel ways that you ahve seen HPCC used?
  • What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC platform and community?
  • What do you have planned for the future of HPCC?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
0:00:14
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity in the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register. Your host is Tobias Macey, and today I'm interviewing Flavio Villeneuve, about the HPC project and his work and his work at Lexis Nexis risk solutions. So Flavio, can you start by introducing yourself,
0:01:36
of course to be so my name is fluid ministry, I'm Vice President of technology and ca. So for Lex and XV solutions. We in the electronics solutions, we have a data platform called the HB systems platform, we have made it both from open source in 2011. And since then, I've been also as part of my role involved with leading the open source community initiative. ensuring that the open source community truly leverages the platform helps contribute to the platform, and certainly creating a liaison between the next Nexus solutions organization and the rest of the larger open source community.
0:02:19
And do you remember how you first got involved in the area of data management
0:02:22
thoroughly has been seamless, and it's probably in the early 90s been going through the database to the database analytics to data management to Data Integration? Keep in mind that even within Lex Nexus, we started the HP systems platform back earlier before year 2000. So back then we already had data management challenges with traditional platforms. And we started with this. And I've been involved since I joined the company in 2002, just with HPC. But before I've been in data management for for a very long time.
0:02:57
And so for the HPC system itself, can you talk through some of the problems that it was designed to solve and some of the issues that you were facing at Lexis Nexis that led to its original creation?
0:03:10
Oh, absolutely. So in Mexico, we solutions, we started with management, I say, our core competency back in the mid 90s. And as we go into risk management, one of the core assets when you are trying to assess risk, and predict outcomes is data. Even before people spoke about big data, we had a significant amount of data, mostly structured, semi structured data to but the vast majority structured. And we used to use the traditional platforms out there, whatever we could get our hands on. And again, this is old, back in the day before Hadoop. And before MapReduce was applied as a distributed paradigm for data management or anything like that. So databases, Sybase, Oracle, whatever was Microsoft SQL, data management platforms of initio information, whatever was available at the time. And certainly the biggest problem we had with a scalable, but was twofold one was the scalability, all of those solutions typically run in a single system. So there is a limit to how much bigger you can go vertically. And certainly, if you're trying to also consider cost affordability of the system. And that limit is that is much lower as well, right, there is a point where you go beyond what the commodity system is, and you start paying a premium price for whatever it is. So that was the first piece. So one of the one of the attempts of solving this problem was to split the data and use different systems but splitting the data, it creates also challenges around data integration, if you're trying to link data, surely you can take the traditional approach, which is you segment your data into tables. And you put those tables in different databases, and then use terms of the foreign key to join the data. But that's all good and dandy as long as you have a foreign key that is unique, handheld is reliable. And that's not the case with data that you acquire from the outside. If you didn't read the data, you can have that if you bring the data from the outside, you might have a record that says these records about john smith, and you might have another record that says this liquid Mr. john smith, but do you know for sure he does. Two records are about the same john smith. And that's, that's a Lincoln problem. And the only way that you can do Lincoln effectively is to put all the data together. And now you have a we have this particular issue where in order to scale, we need to segment the data, in order to be able to do what we need to do, we need to put the data in the same data lake as a dome team. Later, we used to call this data land, eventually we teach it term in the late 2000s. Because data lake become became more more well known. So at that point, the potential bats to overcome the challenge where well, we either split all of the data as we were before, and then we come up with some sort of meta system that will leverage all of these 3d data stores. And potentially, when when you're doing prolific linkage, you have problems that are in have the computational complexity always square or worse. So that means that we will be a significant price and performance but potentially can be done if you have enough time and your systems are big enough, and you have enough bandwidth in between the systems. But the complexity you're gaining from a programming standpoint is also quite significant. And
0:06:33
some things you don't
0:06:34
have enough time some things you get data updates that are maybe hourly or daily. And the doing this big linking process may take you weeks or months if you're doing this across different systems. So and the complexity in programming, this is also pretty significant factor to consider. So at that point, we thought that maybe better approach to these was to create them. So defend an underlying platform to apply this type of solutions to problems with algorithms in a divide and conquer type of approach, we would have something that would partition the data automatically. And that will distribute the data in partitions into different commodity computers. And then we would add an abstraction layer on top of it that would create a programming interface that gave you the appearance that you are dealing with a single system with a single data store. And whatever you coded for that data store would be automatically distributed to the underlying partitions. We would also because of the way the hardware was fighting slower than it is today, we thought that a good idea would be to move also as much of the algorithm as we could to those partitions rather than executing the centrally. So instead of bringing all of the data to a single place to process these, which the single place might not have enough capacity would be to do as much as you can for a couple of brief field operation or a distributed grouping operation or through the filtering operation across each one of the politicians. And eventually, once you need to do the global aggregation, you can do it centrally. But now with a far smaller data set that is already pre filter. And the time came to define how to build abstraction layer. The one thing that we knew about was SQL as a programming language. And we said, well, this must be something that we can track with SQL as a permanent interface for our data analysts. But they work with us quite used to a data flow model because of the type of tools they were using before. Things like a couple of an issue where the data flows are these diagrams where your notes are the operation. So the activities you do and the data and the lines connecting the flow, the activities represent the data traversing those. So we thought that a better approach than SQL would be to create the language that a gave you the ability to build this sort of data flows in that system. That's how easy it was born, which is the language that runs on WHVZ and HPC.
0:09:05
So it's interesting that you had all of these very forward looking ideas in terms of how to approach data management, well, in advance of when the overall industry started to encounter the same types of problems as far as the size and scope of the data that they were dealing with that led to the rise of the Hadoop ecosystem, and the overall ideas around data, lakes and MapReduce and some of the new data management paradigms that have come up. And I'm wondering what the overall landscape looked like in the early days of building the HPC system that required you to implement this in house and some of the other systems or ideas that you drew on for inspiration for some of these approaches towards the data management and the overall systems architecture for HPC?
0:09:52
That is a great question. So it's interesting, because in the early days, when we told people what we were doing, they will look as bad often asked, Well, why don't you use database x, y, z, or data management system XYZ. And the reality is that none of those would be able to cope with the type of data, frequent data process, they wouldn't offer the flexibility of the process, like this probabilistic record linkage that we that I explained before that we do, and certainly good an offer in seamless transition between data management and data forwarding, which was also one of the important requirements that we had a time, it was quite difficult to explain to others why we were doing this, and what we were gaining by doing this. So map and reduce operations as, as functional programming operations have been around for a very long time since the list Lyft days in the 50s. But the idea of using map and with us as operations for the data management didn't get published and build this, I think was September December 2004. And I remember reading the original paper from the Google researchers thinking, well, now someone else has the same problem. And they got to do something about it. Even though at the time we were already we already have HPC. And we already had the CL so it was a perhaps too, too late to go back and try to re implement the data management aspects and the and the programming layer abstraction on HPC. So just for those people in the audience that don't know much about the CL, again is this is all open source or open source and free Apache to license and there are no no strings attached. So please go there and look at it. But in summary, ECL is a declarative Dataflow programming language. And not unlike declarative manner, what you can find in an SQL or functional programming languages, Haskell maybe wait of Bremen, Lisp and closure and other permanent oh there. But if data flow, from their standpoint is closer to something like TensorFlow, if you are familiar with TensorFlow as deep learning, programming paradigm, and framework, so where you could data operations that are primitive, that our data primitives, like for example, sort, you can say sort data set by this column in this order. And then you can add more modifier if you want, you can do a join across data sets. And again, the abrasions join, and you can do a roll up operation and operations name roll up. All of these are high level operations, you define them in your program. And in a declarative programming language, you create definitions, rather than assign variables. For those that are not familiar with declarative programming. And so many are in this audience. The collective programming has, for the most part, the property of having immutable data structures, which doesn't mean that you cannot do valuable work. And you can do all of the work the same way or better. But it gives get gets rid of side effects and other pretty bad issues with a more traditional immutable data structures. So you define out to you to define things, I have a data set that has a phone book, and I want to define an attribute that is this data set, filter by a particular variable. And then I might define another attribute that uses the filter data set to now group it in a particular way. So at the end of the day, any single program is just a set of definitions that are compiled by your compiler. And these compilers is yelling to see which then men reality c++, which then goes into the c++ compiler of the system, this is your plan or whatever you have and generate assembly code. And that that is the goal that is run in the platform. But the fact that you feel is such a high level programming language, and the fact that is declarative means that the CL compiler can take decisions that otherwise more imperative type of programming language wouldn't allow the compiler to take the compiler in a declarative programming language. And functional languages is also in case knows the ultimate goal of the program, because the problem is, in some ways, is a morphic to an equation. And you could even line from a functional standpoint, every one of your statements into a single massive statement, which you of course, can do from a clinical standpoint. But the compiler can now for example, do things like apply non strictness, if you put a statement, if you made a definition that is never going to be used, there is no point for that definition to be even compiled in or executed that all that saves performance equal. If you have a conditional fork in a place in your in your code, but that condition is always met or never met, then there is no need to compile the other branch a all of these gives you performance implications that can be far more significant. When you're dealing with big data. One of the particular optimizations can be around data and calculation, it is a lot far, it's far more efficient than a lot faster, if you are going to do similar operations to every legislator said to combine all of those operations, and do only one person to data with all the abrasions if it's possible. And they combine laser compatible as exactly that. And, and takes away a little bit of the perhaps flexibility on the programmer by making it far more intelligent at the moment, it's compiled. Of course parameters can tell the compiler I know better and forced to do something that may be otherwise unreasonable. But a just an example. You could say, well, I want to sort this data set and I then I want to filter it out and get only these few records. And if you say that in that order there, a an embedded the programming language would first sort and sort of even in the optimal, most optimal case is it's an N login type of operation and condition of complexity, and then fill it out and get only a few records out of it, when the optimal situation would be to filter it out first, and get those few records and then sort those records and ACL competitors. exactly that.
0:16:01
The fact that the language that you're using for defining these manipulations ends up being compiled. And I know that it's implemented and C and c++, both the ACL language itself as well as the overall HPC platform is definitely a great opportunity for better performance characteristics. And I know that in the comparisons that you have available for going between HPC and Hadoop, that's one of the things that gets called out. And as far as the overall workflow for somebody who is interacting with the system using that ECM language. I'm curious if the compilation step ends up being in any way a not a hindrance, but a delaying factor as far as being able to do some experimental iteration or if there is the capability of doing some level of interactive analysis or interaction with the data for being able to determine what is the appropriate set of statements to be able to get the desired end result when you're building an ACL data flow?
0:17:05
Nice. Another great question, I can see that quite diverse
0:17:10
and programming. So you're right, the fact that the seal is compiled means that just again, for for the rest of the audience, we have an integrated development environment policy, like the and of course, we support other like Eclipse and Visual Studio and all of the standard ones, but I'll just talk about it, feel it because it's what I mostly use. In that case, when you write code, you write the ATL code, and then you do, you can certainly run the test of the Golden but if you verify that that gold is, is correct, syntactically, but at some point, you want to run the gold because you want to get it in, you didn't want to know if semantically makes sense, and it will give you the right results. Right so and running the gold we go through the compilation process, depending on how large your code bases, certainly the competition process can take longer. Now the compiler does know what can be modified. Remember, again, a Felisa declarative programming language. So if you haven't touched a number of attributes, and again, data structures are immutable, and add to use the DOM change, since there are no side effects should be exactly the same. So the fact that a when you define a function, that function cause referential transparency, that means that if you call the function at any time, it will give you the correct result, or the same result based on the parameter and just based on the parameter that you're passing with that the compiler can take some shortcuts. And if you are re compiling some bunch of UCL attributes, but you haven't done too many of them, it will just use the pre compiled code for those and only compile those of you have changed. So the completion process, when you are dead, delicately working on code tends to be fairly quick, maybe a few seconds, of course, you depend on having any car company find it available. Traditionally, we used to have a centralized approach to the Excel compiler, when it would be one or a few of them running in the system, we have moved to a more distributed model where when you deploy your refill ID and you refill tools in your workstation, there's a compiler that goes there. So the field completion process can happen in the workstation as well. And that gives you the ability to have it available for you at all times when you're trying to use it. The one of the bottlenecks was at some point before, when you were trying to do this quick adaptive programming approach to things and the compiler was being used by someone that was compiling a massive amount of PCs from some a completely new job, and may have taken minutes and you were does they are sitting, picking your nose waiting for the compiler to to finish that one completed. By the way, the time to compile this is an extremely important consideration. And we continue, we improved the compiler to make it faster. We we have learned you can imagine over a bit. By the way, some of the same core developers have developed the CL compiler governing holiday, for example, have been with us since the very beginning they he was one of the core architects became the initial design of the platform. And he's still the lead architect that is developing that ECM compiler, which means that a lot of the knowledge that has gone into into the compiler process and optimizing it is still getting better and better. Of course, now with the larger community working on the compiler and and more people involved and more documentation around it means that others can pick up where he leaves. But hopefully he will be around and doing this for a long time. And making sure that the compiler is as Justin time as it can be is is very there is no at this point interpreters for ECL. And I think it would be quite difficult to make it completely interactive where the point where you submit just a line of code and does something because of the way a declarative programming paradigm works, right.
0:21:17
And also, because of the fact that you're working most likely, with large volumes of data distributed across multiple nodes, being able to do a rebel driven development is not really very practical, or it doesn't really make a lot of sense. But the fact that there is this fast compilation step in the ability to have a near real time interactivity, as far as seeing what the output of your programming is, it's good to see particularly in the Big Data landscape, where I know that the overall MapReduce paradigm was plagued in the early years by the fact that it was such a time consuming process to submit a job and then see what the output was before you could then take that feedback and wrap that into your next attack. And that's why there has been so many different layers added on top of the Hadoop platform in the form of pig and hive and various sequel interfaces to be able to get a more real time and interactive and iterative development cycle built in.
0:22:14
Yeah, you're absolutely right there. Now, one thing that I haven't told the audience yet is how the platform looks like mine. And I think that this we are getting to the point where it's quite important to explain that there are two main components in the HPC systems platform, there is one component that does data integration, these these massive data management engine equivalent to your data lake management system, which is called for for is meant to run one PCL work unit at a time which a What can it can consist of a large number of abrasions and many of them are running parallel Of course, and there is another one which is known as Roxy which is the data delivery engine there is one which is a sort of a hybrid called AH for now Roxy an H store both are designed to operate in 10s of thousands or more operations at the same time simultaneously, for is meant to do one work unit at a time. So, when you are developing on for even though your completion process might be quick, and you might run on a small data sets quickly, because you can execute this work in it on those little data sets using For example, h4, if you are trying to do the data in large data transformation of a large data sets in your phone system, you still need to go to the queue in that for and you will get your time whenever it's due for you right, surely you can we have priority, so you can jump into a higher priority queue and maybe you are you can be queued after a the just the current job. But before any other future jobs, we also partition jobs into smaller unit. And those smaller units can be also segmented, they are fairly independent from each other. So we could even interleaved some of your your jobs into in between a job that is running by getting into each one of those segments of the of the work unit. But if they get active in this there is a little bit less than a than optimal, but it is the nature of the basis because you want to have a large system to be able to process this throughout all the data in a relatively fast manner. And if we were trying to truly multi process they are most likely many of the resources available, good suffer, so you may end up paying a significant overhead across all of the president or running in parallel. Now. I did say that full run only one working at a time. But that was a little bit of a lie. That was really a few years ago. Today, it does run you you can define multiple QC in a store. And you can make run 34 then work units, but certainly not thousands of them. So that's a that's a big difference between that and Roxy, can you run your work in it and Roxy, yes, or in each floor. And they will run concurrently with anything else that is running with almost no limit their thousands and thousands of them can run at the same time. But there are other considerations on when you run things on Roxy or H store versus in floor. So it might not be what you really want.
0:25:29
Taking that a bit further, can you talk through a standard workflow for somebody who has some data problem that they're trying to solve and the overall lifecycle of the information as it starts from the source system gets loaded into the storage layer for the HPC platform. They define an ACL job for that then gets executed and Thor h store and then being able to query it out the other end from Roxy and just the overall systems that get interacted with each other rage about data life cycle.
0:26:01
co I love to so very well let's let's set up something very simple. As an example, you have a number of data sets that are coming from the outside, you need to load those data sets into HPC. So the first operation that happens is something that is known as spray spray is simple process is an spray comes from the concept of spray painting the data across the cluster, right. So this runs on a Windows box or a Linux box and it will take the data set, let's say that your data set is just given number in million records long. It will unusual as it can be in any format, CSV or or any other or fixed length limited or whatever. So it will look at your data total data set, it will look at the size of the four cluster where the data will be saved initially for processing. And let's say that you have a million records in your data set and you have MN nodes on your for let's just make round numbers and the small numbers. So it will a petition the dataset into 10 partitions because you have to note and it will a then just copy transfer each one of those partitions to the corresponding to full node This is done. If it can be better lies in some way, because for example, your latest fix link, it will automatically use pointers and paralyze this if the data is in either no and XML format or in the limited format where it's very hard to find the partition points, you will need to do a pass in the data, find the friction points and eventually do the panel copying to the thought system. So now you will end up with 10 partitions of the data with the data in no particular order, the Netherlands, all of them that you had before, right. So the first 100,000 records will go to the first note the second 100,000 Records, we go to the second node and so on so forth until you go to the end of the data set this put each one of the nodes in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is spread or
0:28:10
while the data has been sprayed. And depending on the length of the data,
0:28:13
or or even before year, you will most likely need to arrive at work you need to work on the data. And I'm trying to do this example in a way that he said that data The first thing you see that data. So otherwise, all of these automated, right, so you don't need to do anything manually. All of this is scheduled and automated. And working that you already had will run on the new data set and will have appended or whatever it needs to be done. But let's imagine that is completely. So now you write your work unit. And let's say that your latest said was a phone book, and you want to first of all, and even a duplicate, build some rollout views on the phone book. And eventually you want to allow the users to run some queries on a web interface to look up people in the phone book. So you and let's just for the sake of an argument argument, let's say that you're also trying to join that phone book with your customer contact information. So, you will right they will connect that it will have that join to merge those two, you will have some duplication and perhaps some sort of thing. And after you have that you will need to build you will want you don't need to but you will want to build some keys. There is another again, key build processing for the oldest runs on for that will be part of your work unit. So essentially, it's all the CO writer working with ECL submit their work unit, they still will be compile will run on your data, hopefully, they feel will be syntactically correct when you submit it. And it will run with giving you the resource that you were expecting on the data. You see. I mentioned this before, but he says surgical type language as well, which means that it is a little bit harder to errors that will only appear in runtime between the fact that he has no side effects. And that is typically typed most typing errors, type errors they've made in errors and they might into function operations errors are a lot less frequent. There is not like Python, but you may
0:30:17
seem okay. The
0:30:20
run may be fine. But then one run at some point it will give you some we are there because a variable that was supposed to have a piece of text has a number to revise the verse. So you run the work in it, they will give you the result as a result of this work unit will give will potentially give you some statistics and the data some metrics. And he will give you a number of keys, those keys will be also partitioned in four. So there will be filtered nodes, the keys will be partitioning them pieces in those nodes. And you will be able to play those keys as well from for Joe, you can write a few attributes that can do the quoting there. But at some point you will run to you will want write those queries for Roxy to us. And you will want to put the date and Roxy because you don't have one user creating the data you will have a million users going to query that data and perhaps a 10,000 of them will be simple things liquidating. So for the process, you write a another piece of ECL another sort of work in it, but we call this query and you submit that to Roxy instead of four. And there is a slightly different way to submit it to Roxy. So you select Roxy and you submit this, the difference between this query and they work in it I do the heat you have in four is that the query is parameter raised and similar to paradise to proceed in your database, you will find some variables that are supposed to be coming from the front end from the input from the user. And then you just use the values and those variables to run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from for i said before the keys are not mandatory, Roxy can perfectly work without keys can even cast a way to work with in memory distributed data sets as well. So even if you don't have a key, you don't pay a significant price in they look at by doing the sequential look up on the data and the full table scans of your database. So you submit that to Roxy, when you submit that query to Roxy, Roxy will realize that it has the data and it's not in Roxy's in for and this is also your choice, but most likely you will just tell Roxy to load the data from for it will know what to all the data from because he knows what are the keys are and what the names of those keys are, it will automatically load those keys. And also your choice to the Roxy to a stair allowing users to query the front an interface or to a while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically exposed on the front end there is a component called ESP, that component called DSP exposes a web services interface. And this gives you a restful interface, a soap interface, JSON for the payload, if you're going from the restful interface, even AM an old EBC interface if you want. So you can have unit even SQL and front on the front end. So the moment you submit the query, the query automatically generate out to generates these, all of these web service interfaces. So automatically, if you want to go with a web browser on the front end, or if you have an application that can use I don't know a restful interface over HTTP or HTTPS, you can use that and it will automatically have access to that Roxy quality that you submitted, of course, a single Roxy might have not one query but 1000 different queries at the same time, a all of them leasing an interface, and it can have several versions of the same of the queries as well. The queries are all exposed version from the front end. So you know, what they use is an accent. And if you are deploying a new version of equity or modified and extinguish it, you don't burn your users, if you don't want to, you give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process. Now, as you have committed to these while you need to have automation, all of these can be fully automated. In ECL, you may want to have data updates. And I told you data is immutable. So every time you think you're mutating data, you're updating data, you're really creating a new data set, which is good because it gives you full provenance, you can go back to your everyday version, of course, at some point, you need to delete data, or you will run out of space. And that can be also automated. And if you have updates on your
0:34:36
data, we have concepts like super files where you can apply updates, which are essentially new overlays from the existing data. And the existing work unit can just work on that, happily as if it was a single data set. So a lot of these complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So the developers if they don't want to see the underlying complexity, they don't need to, if they do they have the ability to do that I mentioned before Well, ECL will optimize things. So if you tell it, do this, join, but before doing the join to the sword, well, you may know that it is to us or to the sort of won't be that. But a if you know that your latest resorted, you might say, well, let's not do this, or I want to do this join each one of our politicians locally, instead of a global join, and order they are the same thing with sort of disorder operation and ECL of course, if you tell it to do that, and you know better than than the system, you see, I will follow your orders. If not, it will take the safe approach to your operation. Even if it's a little bit more overhead. Of course,
0:35:47
a couple of things that I'm curious about out of this are the storage layer of the HPC platform and some of the ways that you manage redundancy and durability of the data. And I also noticed when looking through documentation that there is some support for being able to take backups of the information, which I know is something that is non trivial when dealing with large volumes. And also on the Roxy side, I know that it maintains an index of the data. And I'm curious how that index is represented and maintained and kept up to date in the overall lifecycle of the platform.
0:36:24
Those are also very good question. So in the case of for for him Cassie concept. So we need to go down to a little bit of a system architecture. So in Thor you have each one of the nodes that handle a primarily they are chunk of data, they are partition of the data. But there is always a body node, some other node that has also their own partition, but they have a copy of the partition of some other nodes. If you have 10 nodes in your cluster view your node number one, I have the first partition and my have a copy the partition that no den has no number two might have a partition number two, but also might have a copy of the partition that no no number one has, and so on so forth. every node would have one primary partition and one backup partition of the other nodes every time you run a work unit. He said that you did he mutable, but you are generating a new data set every time that you are materializing data on the system, either by forcing it to materialize or a by letting the system materialize the data when it's necessary. And the system tries to stream as much in this way similar more similar to spark or or TensorFlow where the data can be streamed from acuity to acuity without being materialized. And like my previous and at some point, he decides that it's the time to materialize because the next operation might require materialized data or because you've been going for too long with data that if something goes wrong with the system will be blown up with every time it materializes data, the lazy copy happening with a new data has materialized to these backup nodes. So surely there is there could be a point where if something goes very wrong, and one of the nodes dies and the data in the disk is corrupted, but you know that you have always another know that has ad copy. And the moment you replace you do with known as Khufu essentially pull it out put another one in the system will automatically revealed that missing partition because it has complete redundancy of all of the data partitions in all the different nodes in the case of Roxy. So in the case of Florida seems to be sufficient, there is of course, the ability to do backups. And you can backup all of these partitions which are just files in the Linux file system. So you can even back them up using any Linux backup utility or you can use HPC to backup for you into any other system you can have cold storage, some of the problems is what happens is where your data center is compromised. And now someone modified or destroyed the data life system. So you may have you may want to have some sort of offline backup. And you can all handle this in the normal system backup configuration, or you can do it the HPC and make it offloaded as well. But for Roxy, the redundancy is a even more critical in the case of for when a node dies, it is sometimes less convenient to let the system work in a degraded way. Because the system is typically as fast as the slowest node. If all nodes are doing the same amount of work, a process it takes an hour will take an hour. But if you happen to have one know the die that now there is one know that he's doing twice the work because he has to do deal with two partitions of data its own and the backup of the other one, the time to the process may take two hours. So it is more convenient to just stop the process when something like that happens. The note and let the system rebuild that note quickly and continue doing the processing. And that might take an hour and 20 minutes or 10 minutes rather than the two hours that otherwise you would have taken. And besides if a system continues to run and your drive your storage system died in one knows because it's old and there is a chance that either the storage systems, when they get under the same stress will die the same way you want to replace that one quickly and have a copy. As soon as you can do not run the risk that you lose two of the of the partitions. And if you lose two partitions that are in different nodes that are not the backup of each other, that's fine. But if you lose the primary node, and the backup node for that one, there is a chance that you may end up losing the entire partition which is which is bad. Again, bad if you don't have a backup and Leland returning back of some things next time. So it's it's also inconvenient. Now and the Roxy case, you are you have a far larger pressure to have the process continue. Because your Roxy system is typically explosive all to online production customers that may pay you a lot of money for you to be highly available.
0:41:06
So Roxy allows you to have define the amount of realness that you want. based on the number of copies that you want, you could say, well, I haven't been a Roxy and as as need, which is the default, a one copy of the data or I need three copies of the data. So maybe they copy the partition, we know the one will be will have a copy in two, or three and four, and so on so forth. Of course, you need four times the space. But you have a far Hager resilience, if something goes very wrong, and Roxy will continue to work, even if a nose is down or you know, top down or, or as many notes as you want that down as long as you have the data is still fine. Because worst case scenario if even if it was a partition completely Roxy Mike, if you want to continue to run, but he won't be able to answer any queries that we're trying to leverage that particular partition that he's gone, which is sometimes not a good situation, when you you ask about the format of the keys and the formatting, they have the keys of the indexes in Roxy is interesting, because those keys, which is again, typically the format of the data that you have in Roxy, for the most part, you will have a primary key, these are all keys that are multi field like in a normal decent database out there. So they have multiple fields a they go, typically they those fields are all over by cardinality. So the fields with the larger cardinality will be at the front to make it more better performing. It has interesting abilities, like for example, you're going to step over a field that you don't have, you have a Wildcat for and still use the remaining fields, which is not something that normally a database doesn't do. Once you have a field that you don't have a value to apply, the rest of the fields on the right hand side are useless. And those Mexico's other things that are quite interesting there. But the way the data is stored in those keys is by decomposing those keys into two components, there is a top level component that indicates which node will have that partition. And there is a bottom level component, which indicates Where in the hell drive they have that a of the node, the specific data elements or the specific block of data elements are. So by decomposing the keys in these two hierarchical levels, it means that every node in Roxy can have the top level of that which is very small. So every node can know where to contact the specific values. So every note can be quoted from the front end, you have now a good scalability on the front end, you can have a load balancer and load balance all of the nodes. And it still on the back end, they can go back and know which node to ask for this when I said that the bottom level has the specific partition, I lied a little bit because he's not been no number one uses multicast. So nodes, when they have a partition of the data they subscribed a multicast channel, what you have in the top level is the number of the multicast channel that will handle that partition that allows us to make Roxy nodes more dynamic and handle. Also the fault tolerance situations where nodes go down. Well, it doesn't matter if you send the message to a multicast channel. Any know that is correct, we get the message, which one to respond well, he will be there faster note they know that is less burdened by other queries, for example. And if any know dies in the channel, it really doesn't matter. You're not stuck in a TCP connection waiting for the handshake to happen because they know the wind the way it is UDP, you send the message, and you will get the response. And of course, if nobody responded in a reasonable amount of time, you can resend that message as well,
0:44:53
going back to the architecture of the system, and the fact that how long it's been in double element and use and the massive changes that have occurred at the industry level as far as how we approach data management and the uses of data and the overall system that we might need to integrate with. I'm curious how the HPC platform itself has evolved accordingly. And some of the integration points that are available for being able to reach out to or consume from some of these other systems that organizations might be using,
0:45:26
we have changed quite a bit. So even those HPC systems name and some of the code base is resembles what we have 20 years ago, as you can imagine, any piece of software, he's a living living entity, and changes and evolved under that I've got as long as the communities active behind us, right. So we have changed significantly, we have not just added functionality, core functionality of HPC or change the functionality I had to adapt to times, but also build integration points. I mentioned a spark for example, and spark. Even though HPC is very similar to spark. spark is a large community around machine learning. So it is useful to integrate with the spark because many times people may be using spark ml. But they may want to use HPC for data management. And having a proper integration where you can run a spark ml and have on top of HPC is something that can be attracted to a significant amount of the HPC open source community. In other cases, like for example, Hadoop and HDFS axes are the same way integrations with other programming languages. Many times people don't feel comfortable programming everything in the CL and ECM works very well for a Data Manager something that is a data management centric process. But sometimes you have little components in the process, for example, that cannot be easily expressed in ECL is not in a way that is efficient.
0:46:55
And I don't know, I'll just throw one little unit together unique, unique ideas for things and you want to deny this you it is unique IDs in a random manner like UUIDs.
0:47:06
Surely you can call this and ECL, you can come back and come up with some crafty way of doing UCL. But he would make absolutely no sense to go to Denise EL, to then be compiled into some big chunk of c++, when I can go to directly in C or c++ or Python, or Java, or JavaScript. So being able to embed all of these languages into ECL became quite important. So we built quite a bit of integration for embedded languages is back even a few very major versions ago a few years ago, we added support for a I mentioned some of his language already Python, Java, JavaScript. And of course C and c++ was already available before. So people can add this little snippet songs functionality create attributes that are just embedded language type of attributes. And those are exposing CLS if they weren't UECO primitives. So now they have the ability of this and expand the ability of the core language to support new things without need to write them in a CL natively every time. And other there are plenty of other enhancements as well on the front end side. So I mentioned ESP ESP is this front end access layer, think of it as a some sort of message box in front of your Roxy system. In the past, we used to require that you code your ACL query for Roxy. And then you need to an ESP source recorded in c++. So you need to go to ESP and extend ESP with a dynamic model to support the front end interface for that query, which is twice the work. And you require someone that also knows c++ know just someone that knows ECL. So we change that. And we use something now that is called dynamic ESDL. That outdoor generates, as I mentioned before these interfaces from ESP, as you go this DCECL, all they want, they'll expect that you will put it there, you will call the query with some permitted eyes interface to a query. And then automatically GSB will take those parameters and expose those in this front end interface for for users to consume the decade, we also have done quite a bit of integration in systems that do that can help with the benchmarking of HPC. availability, monitoring, and performance monitoring all of the capacity planning of HPC as well. So we are we try to integrate as much as we can with our components in the open source community. We truly love open source projects. So if there is a project that already has done something that we can leverage, we try to stay away from reinventing the wheel every time we use it. If it's not open source, if it's commercial, we do have a number of integration with commercial systems as well. We are not to relate, we are not religious about it. But certainly it's a little bit less enticing to put the effort into something that is closed source. And again, we we believe that the model in open source, he says it's a very good model, because it gives us It gives you the ability to know how things have done under the hood and extended and fixed them if you need to. We do this all the time with our projects. We we believe that it has a significant amount of value for for anyone out there.
0:50:26
On the subject of the open source nature of the project, I know that it was released is open source. And I think you said the 2011 timeframe, which posts dates when Hadoop had become popular and started to accrue its own ecosystem. I'm curious what your thoughts are on the relative strength of the communities for Hadoop and HPC. Currently, given that there seems to be a bit of a decline in Hadoop itself as far as the amount of utility that organizations are getting from it, but also interesting in the governance strategy that you have for the HPC platform and some of the ways that you approach sustainability of the project.
0:51:08
So you're absolutely right, the community has apparently at least reached a plateau at psychological and HPC systems community, in number of people. Of course, it was the first to the open. So we have HVC for a very long time he was closed source, he was proprietary, and we didn't we at the time, we believed that he was so core to our competitive advantage that we couldn't afford to release it in any other way. When we realized that reality, the core advantage that we have is on one side data assets on the other side is the high level algorithms. We knew that the platform would be better sustained in the long Randy and sustainability is an important factor for the platform for us because the platform is so core to everything we do that we believe that making it open source and free, completely free as both a no just a freedom of speech, but also free beer. We we thought that that would be the way to ensure this long term sustainability and development and an expansion and innovation in the platform itself. But when we did that it was 2011. So it was a few years after Hadoop, Hadoop, if you remember, it started as part of another project around the web crawling and what called management, which eventually ended up It's a song top level Apache project in 2008, I believe. So it was already three or four and a half years after hundred was out there. And they're coming to us really large. So over time, we did gather a fairly active community. And today we have inactive a very technical, deeply technical community. That is that not just a helps with extending and expanding HPC, but also provides a VS use cases, sometimes interesting use cases of HPC and a and uses HPC in general and regular regularly. So he would it be system community continues to grow, the community seems to have reached a plateau. Now there are other communities out there, which also handle some of the data management aspects with their own platforms like spark I mentioned before, which seems to have a better performance profile than what Hello Cass. So it has been also gathered in active, active people in those communities. Well, I think open source is not a zero sum game where if a community grows, the other one will decrease and then eventually, the total number of people in the community will be the same across all of them. I think every new platform that introduces capabilities to open source communities and uses new ideas and and helps develops, apply innovation into those ideas is helping the overall community in general. So it's great to see communities like a spark community growing. And I think there's an opportunity, and many of the users in both communities are using both at some point for all of them to leverage what is that in the others. Surely, sometimes, the specific language using gold in the platforms, makes a little bit of a bit created a little bit of a barrier. Some of these communities are now just because of the way Java is potentially more common, that use Java instead of c++ and C. So you see that sometimes the people that are in one community who may be more versed in Java, feel uncomfortable going and trying to understand the code in the other platform that is coded in a different language.
0:54:52
But even even then, at least
0:54:55
semi generally VSVLO difference on the on the functions I capabilities can be extracted and used to be added. And I think this is good for the overall benefit of everyone. I see, in many cases open source as a as a experimentation playground, where people can go there can bring new ideas, apply those ideas to some code, and then everyone else eventually leverages them because these ideas percolate across different projects. It's It's It's quite interesting. Having been involved personally in open sources, the early 90s. I I'm quite fond of the of the process, open source work. I think it's it's beneficial to everyone in the in every community.
0:55:37
And in terms of the way that you're taking advantage of the HPC platform, Lexis Nexis and some of the ways that you have seen it used elsewhere. I'm wondering what are some of the notable capabilities that you're leveraging and some of the interesting ways that you've seen other people take advantage of it?
0:55:54
that's a that's a good question. And
0:55:56
that my the answer might take a little bit longer. So in the in Lexis Nexis, in particular, certainly we use HPC. For almost everything we do, because almost everything we do is data management in some way or data quality. Now, we have interesting approaches to things is we have a number of processes that are done on a on data. One of those is this prolific linkage process. And prolific linkage requires sometimes quite a bit of code to make it work correctly. So there was a point where we were ability to finish EL and he was creating a code base that was getting more and more sizable, larger, bigger, less manageable. So at
0:56:39
some point, we decided
0:56:41
that level of abstraction that is pretty high anyway, in ECL, wasn't enough for prolific data linkage. So we created another language we called it sold and we the unrelated language is open source, by the way, it's still providing, but that language is a language that is you're going to consider it a domain specific language for data Liggett productively only get and data integration, so that a compiler for salt, compile salt into CL, and they feel compelled by this EL into c++, c++, clang or GCC compiler into assembler. So you can see how abstraction layers or like layers in an audience, of course, every time you apply an improvement and optimization in the sale compiler, or sometimes the GCC compiler team applies an optimization. And you see everyone else on top of that, of that layer benefits from the optimization, which is quite interesting. We like it so much that eventually we have another problem, which is dealing with graphs. And when I say graphs, I mean social graphs rather than
0:57:46
charts.
0:57:47
So we built yet another language that deals with graphs and machine learning, and particularly machine learning in graphs, which is called Cal or knowledge engineering language, which by the way, we don't have an open source version, but we do have version of the compiler out there for people that want to try. So Gil, also generation CL, and E LD, my c++ and again, back to the same point. So this is a is an interesting approach to building abstraction by Creek, DSL, domain specific languages on top of ACL and other interesting application of HPC, outside of Nexus Nexus is there is a company that is it's called guard pad, they do have that are smart, they can do geo fencing for workers, they can do a detection of of risky environments, in manufacturing environment or in construction. And so they use HPC. And they use some of their real time integration that we have a with things like Afghan couch to be and other integrations I mentioned that we have word activity on integrating HPC with other open source projects to essentially manage all of these data, which is fairly real time.
0:58:57
And a create
0:58:58
this real time Allah and then real time, machine learning, execution for models that they have and integration of data and even visualization on top of it. And and there are more and more a good I could go for days, giving you some some of the ideas there of things that we have done an hour and or others in the community have done using HPC.
0:59:21
And in terms of the overall experience that you have had working with HPC on both the platform and as a user of it, what have you found to be some of the most challenging aspects and some of the most useful and interesting lesson for you've learned in the process?
0:59:38
That is a great question. And
0:59:40
I'll give you a very simple answer. And then I'll explain what I mean. What are some of the biggest challenges, if you are a new user is ECL. Some of the biggest benefits are ACL. Unfortunately, no, not everyone is, is well versed in declarative programming models. So when you are exposed for the first time to a declarative language that
1:00:04
has immutability and laziness. And
1:00:09
the no side effects, it makes sometimes a little bit of a brain twister in some way, right there, you get to, you need to think the problems in a slightly different way to be able to solve them. When you install that it used to embed the programming, you typically solve the problem by decomposing the problem into a just a recipe of things that the computer this process needs to do, step by step one by one, when you do the collective programming, you decompose the problem in a set of functions that need to be applied, and you build it from the ground up. This is slightly different type of, of approach. But it once you get the idea how this works, it becomes quite powerful for a number of reasons, it becomes quite powerful, because first of all, you get to understand the problem more, and you can express the algorithms in a far more succinct way, it would have been just a collection of attributes. And some of the attributes depend on other attributes that you have defined, it also helps you with better encapsulate the components in the problem. So now you're cold instead of becoming just some sort of a spaghetti that is hard to troubleshoot is willing calculated, both in terms of function and calculation, also dating calculation. So if you need to touch anything later on, you can do it safely without need to worry about what this function could be doing that I'm calling here to any to go and also look at the function because you know, there are no side effects. And it also gives you the ability to ECA if you of course, as long as you name your attributes correctly. So people understand what they they are attempting to do, are they they are supposed to do, you can collaborate more easily with other people as well. So after a while, I realized that I was building code in ECLM, and others have also the same way, then realize that they coded the writing the CL is, first of all, mostly correct most of the time, which is not what you do when you have a non declarative code programming. And you know that if the code compiles, there is high chance that the code will run correctly. And it will give you a correct results after it runs. And like I say, was explained before when you have a dynamically typed language is imperative programming with side effects were, surely they called my compile, and maybe it will run fine if your times, but one day, it may give you some sort of runtime error, because some type is mismatch or some side effect that you consider when you re architect some piece of the code now is kicking in and getting your your results different from what you expected. I think, again, a CL has been really quite quite a blessing from that standpoint. But of course, it does require that you learn this you want to learn and you learn this new methodology of programming, which could be similar to what someone that knows, Python or Java needs to learn in order to apply SQL and execute against another declarative language. So use on code SQL interactively. When you are trying to
1:03:34
query a database looking forward in terms of the medium to long term, as well as some of the near term for the HPC platform itself. What do you have planned for the future, both in terms of technical capabilities and features, but also as far as community growth and outreach that you'd like to see.
1:03:53
So from the technical capabilities and features, we tend to have a community roadmap of things and try to as much as we can to stick with those roadmap. So we have some, these big ideas that tend to get into the next or the following major version, these smaller ideas that are typical, non disruptive, and don't break past compatibility that go into the minor versions. And then of course, these bug fixes.
1:04:23
Like many say they are not bugs, but opportunities.
1:04:26
But in the great
1:04:28
at the big ideas side of things, some of the things that we've been doing is doing better integration of I mentioned before integration with other open source projects is quite important. We've been also trying to change some of the underlying components in the blood, there were some components that we have had for a very, very long
1:04:46
time, like, for example, the
1:04:48
underlying communication layer, in Roxy. And for that we think they may be right now for a further revamping, by incorporating some of the standard communications out there. There is also the idea of making the platform far more cloud friendly, even though it does run very well in many public clouds and OpenStack, and Amazon and Google and Azure. But we want to also make the clusters more dynamic. I don't know if you spotted when I said that when you when I explain how you do data management, we're too busy. And he said, Well, you have a note for Well, what happens when you want to change the tenor thought and make it a 20 or 30, or a five notes, or maybe you have a small process, that would work fine with just a couple notes or one knows, you have a large process that may need 1000 nodes. Today, you can dynamically resize the four cluster, surely you can do every if you can resize it by hand, and then do a reboot of the data and now have the data in the number of nodes that you have. But it is a lot more involved than we would like to see it with dynamic cloud environments, the facilities becomes quite important because that's one of the benefits of cloud. So making the classes also more elastic. more dynamic is another big goal. Certainly, we continue to develop machine learning capabilities. On top of it, we have a library of machine learning functions of their algorithms methods. And we are expanding that we sometimes have even some of these machine learning methods, which are quite, I would say innovative one of our core developers and also researchers developed a new distributed algorithm for K means clustering, which she hasn't seen in the literature before. So it's part of one one a paper and her PhD dissertation, which is very good. And the other one is also part of HPC. Now, so now people can leverage this, which gives a significantly higher scalability to K means, particularly if you're running a very large number of nodes, I'm going to get into the details and how it is it creates he said this far better performance. But in in some it distributes the data less. And instead the students the center, it's more and it uses the associative property of the gaming, the main loop of played k means clustering to try to minimize the number of data records that need to be moved around. That's it from the standpoint of the roadmap and the platform itself. On the community side, we continue to try to expand the community as much as we can. One of our core interests is to get I mentioned this core developer who is a also researcher, we want to get more researchers and an academia on the platform, we have a number of initiatives, collaboration initiatives, with a number of universities in the US and abroad university like Oxford University in the UK, University College London, Humboldt University in in Germany, and a number of universities in the US, Clemson University, Georgia Tech and Georgia State University and Annika so we want to expand this program more, we also have an internship program, we believe that one of the one of the things that we see are the goals that we want to achieve as well with with the HPC systems project open source project is to also help balance better the community behind it from balancing diversity across the community. So attracting both both generally but in general, generally vertically and regionally about diversity and background diversity. So we are trying to also put quite a bit of emphasis in students, even high school students, so we are doing quite a bit of activity with high schools, on one side trying to get them more into technology. And of course, learn HPC, but also the outside try to also get more women into technology get more people that otherwise wouldn't get into technology, because they don't get exposed to technology in their homes. And so that's another core piece of activity in HPC, the HPC community. Last but not least, as part of this diversity, there are certain communities that are a little bit more disadvantaged than others. One of those is people in the autism spectrum. So we have been doing quite a bit of activity with organizations that are helping these, these people. So also trying to enable them with a number of activities. And some of those have to do with training them into HPC systems as a platform and data management to give them open opportunities for them for their lives. Many of these individuals are extremely intelligent, they're they're brilliant, they may have other limitations because of their, their conditions. But they will be very, very valuable resources, not just flexible solutions. Ideally, we could tell you there but even for other organizations as well,
1:09:48
it's great to hear that you have all these outreach opportunities as well for trying to help bring more people into technology as a means of giving back as well as as a means of helping to be your community and contribute to the overall use cases that it empowers. So for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today,
1:10:19
I think there are a number of gaps, but the major one is, many of the platforms that are out there tend to be quite clunky, when it comes to integrating things. Unfortunately, we are at the point where we are not, I don't think we are mature enough. So I'm mature enough. I mean, if if you are a data management person, you know data very well, you know, data analytics, you know, data process, but you don't necessarily know operating systems, you don't know, you are not a computer scientist that can deal with data partitioning and computational complexity of algorithms in partition data. And, and there are many details that are necessary for you to do your job should be unnecessary for you to lose your job correctly. But unfortunately, today because of the state of things, many times many of these systems commercial and non commercial force you to take care of all of the details or assemble a large team of people, from system administrators to network, network administrators to operating system specialist to a middle layer, especially some build, you can build a system that can you do your data management, the and that's something that we we do try to overcome with HPC giving the screen in this homogeneous system that you deploy with a single command and that you can use a minute later, after you deployed it, I will say that we are in the ideal situation yet I think there is still much to improve on but I think we are a little bit further along than many of the other options out there. You if you know the the Hadoop ecosystem, you know, how many different components of that are out there. And you know, if you have done this for for a while, you know that one day you realize that they said either know a security vulnerability in one component MB, and now you need to update that. But in order to do that, you're going to break the compatibility of the new version with something else. And now you need to update that other thing. But there is no update for another thing, because that thing depends on another component. So yeah, and this goes on and on and on. So having something that is homogeneous, that it doesn't require for you to be computer scientist to deploy and use. And that truly enables you are the abstraction layer that you need, which is data management is a is a significant limitation of many, many systems out there. And again, not just pointing this at the open source projects, and also commercial product as well. I think it's something that some of the people that are designing and developing the systems might not understand because they are not the users. But they should think as a user, you need to put yourself in the shoes of the user in order to be able to do the right thing. Otherwise, whatever you build is pretty difficult to apply. Sometimes it's useless.
1:13:03
Well thank you very much for taking the time today to join me and describe the ways that HPC is built and architected as well as some of the ways that it's being used both inside and outside of Lexis Nexis. So I appreciate all of your time and all the information there. And it's definitely a very interesting system and one that looks to provide a lot of value and capability. So I appreciate all of your efforts on that front. And I hope you enjoy the rest of your day.
1:13:30
Thank you very much. I really enjoyed this and I look forward to doing this again. So one day we'll get together again. Thank you

Digging Into Data Replication At Fivetran - Episode 93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that Fivetran solves and the story of how it got started?
  • Integration of multiple data sources (e.g. entity resolution)
  • How is Fivetran architected and how has the overall system design changed since you first began working on it?
  • monitoring and alerting
  • Automated schema normalization. How does it work for customized data sources?
  • Managing schema drift while avoiding data loss
  • Change data capture
  • What have you found to be the most complex or challenging data sources to work with reliably?
  • Workflow for users getting started with Fivetran
  • When is Fivetran the wrong choice for collecting and analyzing your data?
  • What have you found to be the most challenging aspects of working in the space of data integrations?}}
  • What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
  • What do you have planned for the future of Fivetran?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINOD. Today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening and databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing George Fraser about five Tran a platform for shipping your data to data warehouses in a managed fashion. So George, can you start by introducing yourself?
0:01:54
Yeah, my name is George. I am the CEO of five Tran. And I was one of two co founders of five trend almost seven years ago when we started.
0:02:02
And do you remember how you first got involved in the area of data management?
0:02:05
Well, before five train, I was actually a scientist, which is a bit of an unusual background for someone in data management, although it was sort of an advantage for us that we were coming at it fresh and so much has changed in the area of data management, particularly because of the new data warehouses that are so much faster and so much cheaper and so much easier to manage than the previous generation, that a fresh approach is really merited. And so in a weird way, the fact that none of the founding team had a background in data management was kind of an advantage.
0:02:38
And so can you start by describing it about describing a bit about the problem that five Tran was built to solve and the overall story of how it got started, and what motivated you to build a company around it?
0:02:50
Well, I'll start with the story of how it got started. So in late 2012, when we started the company, Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013, five turn was originally a vertically integrated data analysis tool. So it had user interface that was sort of a super powered spreadsheets slash BI tool, it had a data warehouse on the inside, and it had a data pipeline that was feeding the data warehouse. And through many iterations of that idea, we discovered that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and the data pipeline became the product. And the problem that five trans solves, is the problem of getting all your company's data in one place. So companies today use all kinds of tools to manage their business. You use CRM systems, like Salesforce, you use payment systems, like stripe support systems like Zendesk finance systems like QuickBooks, or Zora, you have a production database somewhere, maybe you have 20 production databases. And if you want to know what is happening in your business, the first step is usually to synchronize all of this data into a single database, where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that five trend solves people use by trying to do other things. Sometimes they use the data warehouse that We're sinking to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync all of that data into a single database.
0:04:38
And in recent years, one of the prevalent approaches for being able to get all the data into one location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning. And then you can experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake.
0:05:07
Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either set up a data warehouse do full on Kimball dimensional schema, data modeling, and Informatica with all of the upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s3. And I say false choice, because I think the right approach is a happy medium, where you don't go all the way to sticking raw JSON files and CSV files in s3, that's really unnecessary. Instead, you use a proper relational data store. But you exercise restraint, and how much normalization and customization you do on the way in. So you say, I'm going to make my first goal to create an accurate replica of all the systems in one database, and then I'm going to leave that alone, that's going to be my sort of staging area, kind of like my data lake, except it lives in a regular relational data warehouse. And then I'm going to build whatever transformations I want to do have that data on top of that data lake schema. So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead, your data lake should just be a different logical system within the same database that you're using to analyze all your data. And to support your BI tool. It's just a higher productivity simpler workflow to do it that way.
0:06:47
Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the LT pattern has been coming. Because of the flexibility of these cloud data warehouses that you've mentioned, as far as being able to consume semi structured and unstructured data while still being able to query across it and introspective for the purposes of being able to join with other information that's already within that system.
0:07:11
Yeah, the LT pattern is really a just a great way to get work done. It's simple. It allows you to recover from mistakes. So if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you want to transform the data. The great advantage of the LT pattern is that the original untransformed data is still sitting there side by side in the same database. So it's just really easy to iterate in a way that it isn't. If you're transforming the data on the fly, or even if you have a data lake where you like store the API responses from all of your systems, that's still more complicated than if you just have this nice replica sitting in its own schema in your data warehouse.
0:07:58
And so one of the things that you pointed out is needing to be able to integrate across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce for CRM, or things like ticket tracking, and user feedback, such as Zendesk, etc. And I'm wondering what your experience has been as far as being able to map the sort of logical entities across these different systems together to be able to effectively join and query across those data sets, given that they don't necessarily have the shared sense of truth for things like how customers are presented, or even what these sort of common field names might be to be able to map across those different, those different entities.
0:08:42
Yeah, this is a really important step. And the first thing we always advise our customers to do. And even anyone who's building a data warehouse, I would advise to do this is that you need to keep straight in your mind that there's really two problems here. The first problem is replicating all of the data. And the second problem is analyzing all the data into a single schema. And you need to think of these as two steps, you need to follow proper separation of concerns, just as you would in a software engineering project. So we really focus on that first step on replication. What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use sequel, sequel is a great tool for unionizing things, joining things, changing field names, filtering data, all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of sequel queries that run inside your data warehouse.
0:09:44
And do you have your own tooling and interfaces for being able to expose that process to your end users? Or do you also integrate with tools such as DBT, for being able to have that overall process controlled by the end user. So
0:10:00
we originally did not do anything in this area other than give advice, and we got the advantage that we got to sort of watch what our users did in that context. And what we saw is that a lot of them set up cron to run sequel scripts on a regular schedule. A lot of them used liquor, persistent Dr. Tables, some people use airflow, they used air flow, and kind of a funny way, they didn't really use the Python parts of air flow, they just use their flow as a way to trigger sequel. And when DVD came out, we have a decent community of users who use DBT. And we're supportive of whatever mechanism you want to use to transform your data, we do now have our own transformation tool built into our UI. And it's the first version that you can use right now. It's basically a way that you can provide the sequel script, and you can trigger that sequel script, when five Tran delivers new data to your tables. And we've got lots of people using the first version of that that's going to continue to evolve over the rest of this year, it's going to get a lot more sophistication. And it's going to do a lot more to give you insight into the transforms that are running, and how they all relate to each other. But the core idea of it is that sequel is the right tool for transforming data.
0:11:19
And before we get too far into the rest of the feature set and capabilities of five Tran, I'm wondering if you can talk about how the overall system is architected, and how the overall system design has evolved since you first began working on it.
0:11:33
Yeah, so the overall architecture is fairly simple. The hard part of five trend is really not the sort of high class data problems, things like queues and streams and giant data sets flying around. The hard part of five trend is really all of the incidental complexity of all of these data sources, understanding all the small sort of crazy rules that every API has. So most of our effort over the years has actually been devoted to hunting down all these little details of every single data source we support. And that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture was essentially a bunch of EC two instances, with cron, running a bunch of Java processes that were running on a on a fast batch cycle, sinking people's data. Over the last year and a half, the engineering team has built a new architecture based on Kubernetes. There are many advantages of this new architecture for us internally, the biggest one is that it auto scales. But from the outside, you can't even tell when you migrate from the old architecture to the new architecture other than you have to whitelist a new set of IPS. So the you know, it was a very simple architecture. In the beginning, it's gotten somewhat more complex. But really, the hard part of five train is not the high class data engineering problems. It's the little details of every data source, so that from the users perspective, you just get this magical replica of all of your systems in a single database.
0:13:16
And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers. I'm curious what you're using for monitoring and alerting strategy and any sort of ongoing continuous testing, as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?
0:13:42
Yeah, well, first of all, there's several layers to that the first one you is actually the testing that we do on our end to validate that all of our sink strategies, all those little details I mentioned a minute ago are actually working correctly, our testing problem is quite difficult, because we interoperate with so many external systems. And in many cases, you really have to run the tests against the real system for the test to be meaningful. And so our build architecture is actually one of the more complex parts of five train, we use a build tool called Bazell. And we've done a lot of work, for example, to run all of the databases and FTP servers and things like that, that we have to interact with in Docker containers so that we can actually produce reproducible Ed tests. So that actually is one of the more complex engineering problems at five trend. And if that sounds interesting to you, I encourage you to apply to our engineering team, because we have lots more work to do on that. So that's the first layer is really all of those tests that we run to verify that our sync strategies are correct. The second layer is that, you know, is it working in production is the customers data actually getting sick and as a getting synced correctly, and one of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all five trans data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize, we stop. Now, that's different than when you build data pipelines yourself, when you build data pipelines for your own company, usually, you will have them try to keep going no matter what. But five train is a fully managed service. And we're monitoring and all the time. So we tend to make the opposite choice of anything suspicious is going on, the correct thing to do is just stop and alert five Tran, hey, go check out this customers data pipeline, what the heck is going on? Something unexpected happen is happening. And we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail, we get alerted and the customer gets alerted at the same time. And then we communicate with the customer. And we say hey, we may need to go in and check something Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out what's going wrong, because five trained as a fully managed service. And that is critical to making it work. When you do we do and you say we are going to take responsibility for actually creating an accurate replica of all of your systems in your data warehouse. That means you're signing on to comprehend and fix every little detail of every data source that you support. And a lot of those little details only come up in production when some customer shows up. And they're using a feature of Salesforce that Salesforce hasn't sold for five years, but they've still got it. And you've never seen it before. Some of a lot of those little things only come up in production. The nice thing is that that set of little things, well, it is very large, it is finite. And we only have to discover each problem once and then every customer thereafter. benefits from that. Thanks
0:17:00
for the system itself. One of the things that I noticed while I was reading through the documentation, and the feature set is that for all of these different source systems, you provide automated schema normalization. And I'm curious how that works. And the overall logic flow that you have built in, if it's just a static mapping that you have, for each different data source, are there some sort of more complex algorithm that's going on behind the scenes there, as well as how that works for any sort of customized data sources, such as application databases that you're working with, or maybe just JSON feeds or event streams?
0:17:38
Sure. So the first thing you have to understand is that there's really two categories of data sources in terms of schema normalization. The first category is databases, like Oracle, or my sequel, or Postgres, and database, like systems like NetSuite is really basically a database when you look at the API. So Salesforce, there's a bunch of systems that basically look like David bases, they have arbitrary tables, columns, you can set any types you want in any column, what we do with those systems is we just create an exact one to one replica of the source schema, it's really as simple as that. So there's a lot of work to do, because the change feeds are usually very complicated from those systems. And it's very complex. To turn those change feeds back into the original schema, but it is automated. So for databases and database, like skeet systems, we just produce the exact same schema and your data warehouse as it was in the source for apps are things like stripe, or Zendesk or GitHub or JIRA, we do a lot of normalization of the data. So tools like that, when you look at the API responses, the API responses are very complex and nested, and usually very far from the original normalized schema that this data probably lived in, in the source database. And every time we add a new data source of that type, we study the data source, we, I joke that we reverse engineer the API, we basically figure out what was the schema and the database that this originally was, and we unwind all the API responses back into the normalized schema. These days, we often just get an engineer at the company that is that data source on the phone and ask them, you know, what is the real schema here, we can find we found that we can save ourselves a whole lot of work by doing that. But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think, if we put in that work up front to normalize the data in your data warehouse, we can save every single one of our customers a whole bunch of time, traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in up front so our customers don't have to.
0:20:00
One of the other issues that comes up with normalization. And particularly for the source database systems that you're talking about is the idea of schema drift, when new fields are added or removed, or a data types change, or the overall sort of the sort of default data types change. And we're wondering how you manage schema drift overall, in the data warehouse systems that you're loading into well, preventing data loss, particularly in the cases where a column might be dropped, or the data type changed.
0:20:29
Yeah, so it's, it's, there's a core pipeline that all five trend, connectors, databases, apps, everything is written against that we use internally. And all of the rules of how to deal with schema drift are encoded there. So some cases are easy. Like, if you drop a column, then that data just isn't arriving anymore, we will leave that column in your data warehouse, we're not going to delete it in case there's something important in it, you can drop it in your data warehouse, if you want to, we're not going to, if you add a column, again, that's pretty easy. We add a column and your data warehouse, all of the old rows will have nodes in that column, obviously, but then going forward, we will populate that column. The tricky cases are when you change the types. So when you when you alter the type of an existing column that can be more difficult to deal with. Now, we will actually, there's two principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're going to go and change the type of the column in your data warehouse to fit the new data. And the second principle we follow is that when you change types, sometimes you sort of contradict yourself. And we follow the rule of subtitling in in handling that, if you think back to your undergraduate computer science classes, this is the good old concept of subtypes, for example, and into the subtype of a real a real is a subtype of a string, etc. So we, we look at all the data passing through the system, and we infer what is the most specific type that can contain all of the values that we have seen. And then we alter the data warehouse to be that type, so that we can actually fit the data into the data warehouse.
0:22:17
Another capability that you provide is Change Data Capture for when you're loading from these relational database systems into the data warehouse. And that's a problem space that I've always been interested in as far as how you're able to capture the change logs within the data system, and then be able to replay them effectively to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform?
0:22:46
Yeah, it's very complicated. Most people who build in house data pipelines, as you say, they just do a dump and load the entire table, because the change logs are so complicated. And the problem with dumping load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people once a day. So we do Change Data Capture, we read the change logs of each database, each database has a different change log format, most of them are extremely complicated. If you look at the MySQL change log format, or the Oracle change log format, it is like going back in time to the history of MySQL, you can sort of see every architectural change in MySQL in the change log format the answer to how we do that there's, there's no trick. It's just a lot of work, understanding all the possible corner cases of these chains lugs, it helps that we have many customers with each database. So the unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users, we have lots of Postgres users. And so over time, we see all the little corner cases, and you eventually figure it out, you eventually find all the things and you get a system that just works. But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at five trend, who at this point, has been working on it for years with, you know, hundreds of customers. So at this point, it's you know, we've invested so much effort in tracking that all those little things, there's just like no hope that you could do better yourself, building a change the reader just for your own company
0:24:28
for the particular problem space the year and you have a sort of many too many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses. And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to get it production ready in the shortest amount of time.
0:24:57
Yeah, it's funny, you know, if you ask any engineer, five Randall, they can all tell you what the most difficult data sources are, because we've had to do so much work on on them over the years. Undoubtedly, the most difficult data sources is Mark Hedo, close seconds or JIRA, Asana and then probably NetSuite. So those API's, they just have a ton of incidental complexity, it's really hard to get data out of them fast. We're working with some of these sources to try to help them improve their API's to make it easier to do replication, but there there's a handful of data sources that have required disproportionate work to to get them working reliably. In general, one funny observation that we have seen over the years is that the companies with the the best API's tend to unfortunately be the least successful companies. It seems to be a general principle that companies which have really beautiful well, organized API's tend to not be very successful businesses, I guess, because they're just not focused enough on sales or something. We've seen it time and again, where we integrate a new data source, and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sink for them. The one exception, I would say is stripe, which has a great API, and is a highly successful company. And that's probably because their API is their products. So there's there's definitely a spectrum of difficulty. In general, the oldest largest companies have the most complex API's,
0:26:32
I wonder if there's some reverse incentive where they make their API's obtuse and difficult to work with, so that they can build up an ecosystem around them of contractors who are whose sole purpose is to be able to integrate them with other systems.
0:26:46
You know, I think there's a little bit of that, but less than you would think. For example, the company that has by far the most extensive ecosystem of contractors, helping people integrate their tool with the other systems is Salesforce. And Salesforce is API is quite good. Salesforce is actually one of the simpler API is out there. It was harder a few years ago when we first implemented it. But they made a lot of improvements. And it's actually one of the better API's now.
0:27:15
Yeah, I think that's probably coming off the tail of their acquisition of MuleSoft to sort of reformat their internal systems and data representation to make it easier to integrate. Because I know beforehand, it was just a whole mess of XML.
0:27:27
You know, it was really before the meal soft acquisition that a lot of the improvements in the Salesforce API happened, the Salesforce REST API was I was pretty well structured and rational, five years ago, it would fail a lot, you would send queries and they would just not return when you had really big data sets, and now it's more performance. So I think it predates the Millsap acquisition, they just did the hard work to make all the corner cases work reliably and scale the large data sets and, and Salesforce is now one of the easier data sources to actually think there are certain objects that have complicated rules. And I think the developers at five train who work on Salesforce will get mad at me when they hear me say this. But compared to like NetSuite, it's, it's pretty great.
0:28:12
On the other side of the equation, where you're loading data into the different target data warehouses, I'm wondering what your strategy is, as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of equal representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses.
0:28:36
So most of the code for doing the load side is shared between the data warehouses, the differences are not that great between different destinations, except Big Query Big Query is a little bit of a unusual creature. So if you look at five trans code base, there's actually a different implementation for Big Query that shares very little with all of the other destiny. So the differences between destinations are not that big of a problem for us, there are certain things that that do, you know, there's functions that have to be overwritten for different destinations for things like the names of types and, and there's some special cases around performance where our load strategies are slightly different, for example, between snowflake and redshift, just to get faster performance. But in general, that actually is the easier side of the business is the destinations. And then in terms of transformations, it's really up to the user to write the sequel that transforms their data. And it is true that to write effective transformations, especially incremental transformations, you always have to use the proprietary features of the particular database that you're working on.
0:29:46
On the incremental piece, I'm interested in how you address that for some of the different source systems, because for the databases, where you're doing Change Data Capture, it's fairly obvious that you can take that approach for a data loading. But for some of the more API oriented systems, I'm wondering if there are if there's a high degree of variability of being able to pull in just the objects that have changed since a certain last sync time, or if there are a number of systems that will just give you absolutely everything every time and then you have to do the thing on your side,
0:30:20
the complexity of those dangers. I know I mentioned this earlier, but it is it is staggering. But yes, I'm the API side, we're also doing Change Data Capture of apps, it is different for every app, but just about every API we work with provides some kind of change feed mechanism. Now it is complicated, you often end up in a situation where the API will give you a change feed that's incremental, but then other endpoints are not incremental. So you have to do this thing where you read the change feed, and you look at the individual events and the change feed, and then you go look up the related information from the other entity. So you end up dealing with a bunch of extra complexity because of that. But as with all things at five train, we have this advantage that we have many customers with each data source. So we can, we can put in that disproportionate effort that you would never do if you were building it just for yourself to make the change capture mechanism work properly, because we just have to do it once and then everyone who uses that data source can benefit from it.
0:31:23
For people who are getting on boarded onto the five trans system. I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like, as they're adding new sources or just interacting with their five trading account for being able to keep track of the overall health of their system, or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side,
0:31:47
it's pretty simple. The joke at five trend is that our demo takes about 15 seconds. So because we're so committed to automation, and we're so committed to this idea that five trends fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it, it means that there's very little UI, the process of setting up a new data source is basically Connect source, which for many sources is as simple as just going through an OAuth redirect, and you just click you know, yes, 510 is allowed to access my data. And that's it. And connect destination which, which now we're actually integrated with snowflake and big queries, you can just push a button in snowflake or in Big Query and create a five train account that's pre connected to your data warehouse. So the setup process is really simple. There's once after setup, there's a bunch of UI around monitoring what's happening, we like to say that five Tran is a glass box, it was originally a black box. And now it's it's a glass box, you can see exactly what it's doing. You can't change it. But you can see exactly what we're doing at all times. And you know, part of that is in the UI. And part of that is an emails you get when things go wrong and or the sink finishes for the first time, that kind of thing.
0:33:00
Part of that visibility, I also noticed that you will ship the transaction logs to the end users log aggregation system. And I thought that was an interesting approach, as far as being able to give them away to be able to access all of that information in one place without having to go to your platform just for that one off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall visibility into your systems and the ways that they're able to integrate it into their monitoring platforms?
0:33:34
Yeah, so the logs we're talking about are the logs of every action five train took like five drain made this API call against Salesforce five ran ran this log minor query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like cloud watch or stack driver, because a lot of companies have like a in the same way, they have a set centralized data warehouse, they have a centralized logging system. It's mostly used by larger companies, those are the ones who invest the effort in setting up those centralized logging systems. And it's actually the system we built first, before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just there's a quick way to view what's going on. And, yeah, I think people have appreciated that we're happy to support the systems they already have, rather than try to build our own thing and force you to use that.
0:34:34
I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall lifecycle of their data as it spans across different systems.
0:34:47
You know, that's not so much of a logging problem, that's more of like a metadata problem inside the data warehouse, when you're trying to track lineage to say, like, this row in my data warehouse came from this transformation, which came from these three tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etc. that lineage problem is really more of a metadata problem. And that's kind of a Greenfield in our area right now. There's a couple different companies that are trying to solve that problem. We're doing some interesting work on that in conjunction with our transformations. I think it's a very important problem. It's still still a lot of work to be done there.
0:35:28
So on the sales side of things to I know, you said that your demo is about 15 seconds as far as Yes, you just do this, this and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have that bring them to you as far as evaluating your platform for their use cases. And just some of the overall user experience design that you've put into the platform as well, to help ease that onboarding process.
0:35:58
Yeah, so a lot of the discussions in the sales process really revolve around that ELT philosophy of five train is going to take care of replicating all of your data, and then you're going to cure curated non-destructively using sequel, which for some people just seems like the obvious way to do it. But for others, this is a very shocking proposition, this idea that your data warehouse is going to have this comparatively and curated schema, that five trend is delivering data into and then you're basically going to make a second copy of everything. For a lot of people who've been doing this for a long time. That's a very surprising approach. And so a lot of the discussion and sales are rolls around the trade offs of that and why we think that's the right answer for the data warehouses that exists today, which are just so much faster, and so much cheaper, that it makes sense to adopt that more human friendly workflow than maybe it would have in the
0:36:52
90s. And what are the cases where five trend is the wrong choice for being able to replicate data or integrated into it data warehouse?
0:37:00
Well, if you already have a working system, you should keep using it. So I would we don't advise people to change things just for the sake of change. If you've set up, you know, a bunch of Python scripts that are sinking all your data sources, and it's working, keep using it, what usually happens that causes people to take out a system is schema changes, death by 1000 schema changes. So they find that the data sources upstream are changing, their scripts that are sinking, their data are constantly breaking, it's this huge effort to keep them alive. And so that's the situation where prospects will abandon existing system and adopt five trend. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it, just just keep using it.
0:37:49
And as far as the overall challenges or complexities of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult overcome, or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering building their own pipeline from scratch.
0:38:11
Yeah, you know, I think that when we got our first customer in 2015, sinking Salesforce to redshift, and two weeks later, we got our second customer thinking Salesforce and HubSpot and stripe into redshift, I sort of imagined that this sync problem was like going to be we were going to have this solved in a year. And then we would go on and build a bunch of other related tools. And the sink problem is much harder than it looks at first, getting all the little details, right. So that it just works is an astonishingly difficult problem. It, it is a parallel parallel problem, you can have lots of developers working on different data sources, figuring out all those little details, we have accumulated general lessons that we've incorporated and adore core code. So we've gotten better at doing this over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of complexity associated with just doing replication correctly, was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem, you know, you look at the API docs have a data source, and you figure Oh, I think I know how I'm going to sync this. And then you go into production with 10 customers. And suddenly, you find 10 different corner cases that you never thought of that are going to make it harder than you expected to sink the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing.
0:39:45
on both the technical and business side, I'm also interested in understanding what you have found to be as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing five Tran?
0:39:59
Well, I've talked about some of the technical lessons in terms of you know, just solving that problem really well as is both really hard and, and really valuable. In terms of the business lessons we've learned. It's, you know, growing the company is like a co equal problem to growing the technology, I've been really pleased with how we've made a place where people seem to genuinely like to work, where a lot of people have been able to develop their careers in different ways different people have different career goals. And you need to realize that as someone leading a company, not everyone at this company is like myself, they have different goals that they want to accomplish. So that that problem of growing the company is just as important. And just as complex as solving the technical problems and growing the product and growing the sales side and helping people find out that you have this great product that they should probably be using. So I think that has been a real lesson for me over the last seven years that we've been doing this now for the future of five trend, what do you have planned both on the business roadmap, as well as the feature sets that you're looking to integrate into five Tran and just some of the overall goals that you have for the business as you look forward?
0:41:11
Sure. So
0:41:12
some of the most important stuff we're doing right now is on the sales and marketing side, we have done all of this work to solve this replication problem, which is very fundamental and very reusable. And I like to say no one else should have to deal with all of these API's. Since we have done it, you should not need to write a bunch of Python scripts to sink your data or configure Informatica or anything like that. And we've done it once so that you don't have to, and I guarantee you, it will cost you less to buy five trend than to have your own team basically building a house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that five trends out there. And that might be something that's really useful to you on the product side, we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations tool in our product, there's going to be a lot more sophistication getting added to that over the next year, we really view that as the next frontier for five trend is helping people manage the data after we've replicated that,
0:42:17
are there any other aspects of the five train company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show?
0:42:28
I don't think so I think the the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources. The you know, Kafka is not going to solve this problem for you. spark is not going to solve this problem for you. There is no fancy technical solution. Most of the difficulty of the data centralization problem is just in understanding and working around all of the incidental complexity of all these data sources.
0:42:58
For anybody who wants to get in touch with you or follow along with the work that you and five Tran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
0:43:15
Yeah, I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data after it arrives. So writing all the sequel that curates the data into a format that's ready for the business users to attack with BI tools is a huge amount of work, it remains a huge amount of work. And if you look at the workflow of the typical analysts, they're writing a ton of sequel. And they're using tools that it's a very analogous problem to a developer writing code using Java or C sharp, but the tools that analysts have to work with look like the tools developers had in like the 80s. I mean, they don't even really have autocomplete. So I think that is a really under invested then problem, just the tooling for analysts to make them more productive in the exact same way. As we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts to and I think it hasn't happened yet.
0:44:13
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it five Tran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front and I hope Enjoy the rest of your day.
0:44:31
Thanks for having me on.

Simplifying Data Integration Through Eventual Connectivity - Episode 91

Summary

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
  • What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
  • In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
    • How do different implementations of graph databases impact their viability for this use case?
  • Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
  • How much up-front modeling is necessary to make this a viable approach to data integration?
  • How do the volume and format of the source data impact the technology and architecture decisions that you would make?
  • What are the limitations or edge cases that you have found when using this pattern?
  • In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
  • What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Workflow Engine For Data Engineers And Data Scientists - Episode 86

Summary

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Prefect is and your motivation for creating it?
  • What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
  • In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
  • How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
  • How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
  • What was your decision making process when deciding to use Dask as your supported execution engine?
    • For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
  • Does Prefect support managing tasks that bridge network boundaries?
  • What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
  • What are the limitations of the open source core as compared to the cloud offering that you are building?
  • What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
  • When is Prefect the wrong choice?
  • In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
    • What are some best practices and industry trends that you are most excited by?
  • What do you have planned for the future of the Prefect project and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Lineage For Your Pipelines - Episode 82

Summary

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pachyderm is and how it got started?
    • What is new in the last two years since I talked to Dan Whitenack in episode 1?
    • How have the changes and additional features in Kubernetes impacted your work on Pachyderm?
  • A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?
  • Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?
    • How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?
  • There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?
  • Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?
    • What is the interface for exposing and exploring that provenance data?
  • What are some of the advanced capabilities of Pachyderm that you would like to call out?
  • With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?
  • What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?
  • What do you have planned for the future of Pachyderm?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA