Open Source

Navigating Boundless Data Streams With The Swim Kernel - Episode 98

Summary

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Swim.ai is and how the project and business got started?
    • Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
  • What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
  • How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
  • Can you describe a typical design for an application or system being built on top of the Swim platform?
    • What does the developer workflow look like?
      • What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
  • Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
  • For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
    • What mechanisms are in place to account for network failures?
  • Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
  • Since there is no explicit data layer, how is data redundancy handled by Swim applications?
  • What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
  • What have you found to be the most challenging aspects of building the Swim platform?
  • What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
  • What do you have planned for the future of the technical and business aspects of Swim.ai?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Lynn ODE with 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering, podcast.com, slash Lenovo, that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and media or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence data Council, upcoming events and do Riley AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Simon Crosby about swim.ai, the Data Fabric for the distributed enterprise. So Simon, can you start by introducing yourself?
Simon Crosby
0:02:28
Hi, I'm Simon Crosby, I am the CTO, I guess of long duration. I've been around for a long time. And it's a privilege to be with the swim folks who have been building this fabulous platform for streaming data for about five years.
Tobias Macey
0:02:49
And do you remember how you first got involved in the area of data management?
Simon Crosby
0:02:53
Well, I have a PhD in applied mathematics and probably, so I am kind of not data management guy. I'm an analysis guy. I like what comes out of, you know, streams of data and what influence you can draw from it. So my background is more on the analytical side. And then along the way, I saw begin to how to build big infrastructure for it.
Tobias Macey
0:03:22
And now you have taken up the position as CTO for swim.ai, I'm wondering if you can explain a bit about what the platform is and how the overall project and business got started?
Simon Crosby
0:03:33
Sure. So here's the problem. We're all reading all the time. But these wonderful things that you can do with machine learning, and streaming data, and so on, it all involves cloud and other magical things. And in general, most organizations chest don't know how to make head or tail of that, for a bunch of reasons, it's just too hard to get there. So if you're an organization, with assets, that are chipping out lots of data, and that could be a bunch of different types, you know, you probably don't have the skill fit in house to deal with a vast amount of information. And we're talking about boundless data sources, yet things that never showed up. And so to deal with these data flow pipelines to deal with it itself, to deal with the learning and inferences you might draw from that, and so on. And so, enterprises, a huge skill set challenge. There is also a cost challenge, because today's techniques related to drawing inference from data in general resolve with it, you know, in large, expensive, dead legs, either in house or perhaps in the cloud. And then finally, there's a challenge with the timeliness within which you can draw an insight. And most folks today, believe that you store data, and then you think about it in some magical way. And you draw inference from that. And we're all suffering from the Hadoop Cloudera, I guess, after effects, and really, this notion of storing and then analyzing needs to be dispensed with in terms of fast it, certainly for boundless data sources that will never stop. It's really inappropriate. So when I talk about boundaries, today, we're going to talk about data streams that just never stop. And Ross can talk web, the need to derive insights from that data on the fly, because if you don't, something will go wrong. So it's of the type that would stop your car before you hit the pedestrian, the crosswalk, that kind of stuff. So for that kind of data, there's just no chance to know still down hard disk. And then
Tobias Macey
0:06:16
and how would you differentiate the work that you're doing with the swimming AI platform and the swim OS kernel from things that are being done with tools such as Flink or other streaming systems, such as Kafka that is now got capabilities for being able to do some limited streaming analysis on the data as it flows through, or also platforms such as wall a room that are built for being able to do state for computations on data streams?
Simon Crosby
0:06:44
So first of all, there have been some major steps forward. And anything we do we stand on the shoulders of giants. Let's start off with distinguishing between the large enterprise skill set that's out there, and the cup world. And all the things you mentioned live in the cloud world. So at that reference distinction, most people in the enterprise when you said Flink wouldn't know what the hell you talking about. Okay, similarly will ruin anything else, they just wouldn't know what you're talking about. And so there is a major problem with the tools and technologies that are built for the cloud, really for against for log cloud native applications, and the majority of enterprises who just their step with legacy IT and application skill set, and they still come up to speed with the right thing to do. And to be honest, they're getting over the headache of Hadoop. So then, if we talk about cloud native world, there is a fascinating distinction between all the various projects, which have started to tackle streaming data. And there have been some major progress has been made some major progress there, Jim be delighted to point out some being one of them, and have been going into each one of those projects in detail as we go forward. The key point being that, first and foremost, the large majority of enterprises just don't
Tobias Macey
0:08:22
know what to do. And then within your specific offerings, there is the data Friedberg platform, which you're targeting for enterprise consumers. And then there's also the open source kernel of that in the form of swim OS. I'm wondering if you can provide some explanation as to what are the differentiating factors between those two products and the sort of decision points along when somebody might want to use one versus the other?
Simon Crosby
0:08:50
Yeah, let's cut it first at the distinction between the application layer and the infrastructure needed to run largest distribute data for pipeline. And so for swim all of the application layer stuff that then there's everything you need to build nap is entirely open source. Some of the capabilities that you want to run a large distributed data pipeline are proprietary. And that's really just because, you know, we're building a business around this, we plan to open source more and more features more and more features over time.
Tobias Macey
0:09:29
And then as far as the primary use cases that you are enabling with the swim platform, and some of the different ways that enterprise organizations are implementing it, what are some of the cases were using something other than swim, either the OS or the Data Fabric layer would be either impractical or intractable if they were trying to use more traditional approaches such as Hadoop, as you mentioned, or data warehouse and more batch oriented workflows?
Simon Crosby
0:09:58
So So let's start off describing what swim does, can it can I do that, that that might help our in our view, it's our job to build the pipeline, and indeed the model from the data. Okay, so swim, just once data, and from the data we will build, automatically build this typical data flow pipeline. And indeed, from that, we will build a model of arbitrarily interesting complexity, which allows us to solve some very interesting problems. Ok. So the swim perspective, starts with data. Because that's where our customers journey starts. They have lots and lots of data, they don't know what to do with it. And so the approach we take and swim is to allow the data to build the model. Now, you would naturally say that's impossible, in general, but requires is some oncology at the edge, which describes the dead, you could think of it as a schema, in fact, basically, to describe what data items mean, in some sort of useful sense to us as modelers. But then given data swim will build that model. So let me give you an example. Given a relatively simple ontology, for traffic, and traffic equipment, so position lights, the loops, and the road, the lights and so on, swim will build a model, which is a staple, digital twin is where for every sensor, every in every source of data, which is running in concurrently in some distributed fabric, and processes its own raw data and truly evolves, okay. So simply given that ontology, some knows how to build, stay faithful, concurrent, little things we call web engines, actually, I'm yeah, I'm using that term,
0:12:18
I guess the same as digital twin.
0:12:21
And these are concurrent things which are going to stay fluid process raw data and represent that in a meaningful way. And the cool thing about that is that each one of these little digital twins exists in a context, a real world context, that term is going to discover for us. So for example, a an intersection might have 60, to 80, sensors. So this notion of containment, but also, intersections are adjacent to other sections in the real world map. And so on. That notion of a Jason's is also real world relationship. And in swimming, this notion of a link allows us to express the real world relationships between these little digital twins. And linking in swim has this wonderful additional property, which is to allow us to express it essentially, as soon swim, there is never a pub, but there is up. And if something links to something else so filing to you, then it's like LinkedIn for things, I get to see the real time updates of in memory state still buy that digital twin. So digital twins, a link to a digital twins courtesy of real world relationships, such as containment or proximity. We can even do other relationships, like correlation,
0:14:05
also linked to each other, which allows them to share data.
0:14:09
And sharing data allows interesting computational properties to be derived. For example, we can learn and predict. Okay, so job one is to define the songs ology something goes and builds a graph, and a graph of digital twins, which is constructed entirely from the data. And then the linking happens as part of that. And that allows us to then construct interesting competitions.
0:14:45
Is that useful?
Tobias Macey
0:14:46
Yes, that's definitely helpful to get an idea of some of the use cases and some of the ways that the different concepts within swim work together to be able to build out to what a sort of conceptual architecture would be for an application that would utilize swim.
Simon Crosby
0:15:03
So the key thing here is I'm talking about an application bit just said, the application is to predict the future, the future traffic in a city, or what's going to happen in the traffic area right. Now, I could do that for a bunch of different cities, what I can tell you is I need a model for each city. And there are two ways to build a model. One way is I get a data scientist to have them build them, or maybe they train it and a whole bunch of other things. And I'm going to have to do this for every single city where I want to use this application. The other way to do it is to build the model from the data. And that's the approach. So what swim does is simply given the ontology, build these little digital twins, which are representatives of the real world things, get them to stay fully evolve, and then link to other things in, you know, to represent real world relationships. And then suddenly, hey, presto, you have built a large graph, which is effectively the model that you would have had to average a human build otherwise, right? So it's constructed in the sense that in any new city you go to this thing is just going to unbundle and just given a stream of data, it will build a model, which represents the things that are the sources of data and their physical relationships. That make sense.
Tobias Macey
0:16:38
Yeah, and I'm wondering if you can expand upon that, in terms of the type of workflow that a developer who is building an application on top of swim would go through as far as identifying what those ontology is, are and defining how the links will occur as the data streams into the different nodes in the swimming graph.
Simon Crosby
0:17:01
So the key point here is that we think that we will do, and then we can build, like 80% of a nap, okay, from the data. And that is we can find all of the big structural, red, all structural properties of relevance in the data, and then let the, the application builder drop in what they want to compute. And so let me try and express is slightly differently. Job, one, we believe is to build a model of the staple digital twins obby, which almost mirror their real world counterparts. So at all points in time, their job is to represent the real world, as faithfully and as close to real time as they can in a stable way, which is relevance to the problem at hand. Okay, so rather involved, so I'm going to have a red light, okay, something like that. And the first problem is to build this, the central digital twins, which are interlinked, which represent the real world being said, okay, and it's important to separate that, from the application layer component of what you want to compute from that. So frequently, we see people making the wrong decision that is hard, hard coupling, the notion of prediction, or learning or any other form analysis into the application in such a way that any change requires programming. And we think that that's wrong. So job one is to have this faithful representation of a real time world in which everything evolves its own state, whenever it's real world when evolves, and evolves stay pretty. And then the second component to that is, which of which we do on a separate timescale is to inject operators, which are going to then compute on the states of those things at the edge, right. So we have a model, which represents the relationships between things in the real world. It's attempting to evolve as close as possible to real time in relationship to the real world twin, and it's reflecting its links and so on. But the notion what you want to compute from it is separate from that and decoupled. And so the second step, which is an application, or building an application right here, right now, is to drop in an operator, which is going to compute a thing from that. So you might say, cool, I want every digital, every intersection to compute, you know, to be able to learn from its own behavior and predict. That's one thing, we might say, I want to compute the average wait time of every kind and see, that's another thing. So the key point here is that computing from these rapidly evolving world worldviews, is decoupled from the actual model of what's going on in that world and point in time. So it's from reflects that decoupling by allowing you to bind operators to the model whenever you want.
0:20:45
Okay,
0:20:46
bye whenever you want. I mean, you can write them in code and bits of job or whatever. But also, you can write them in blobs of JavaScript or Python, and dynamically insert them into a running model. Okay, so let me make that one concrete for you. I could have a deployed system, which is a model a deployed graph of digital twins, which are currently mirroring the state of Las Vegas. And data dynamically, a data scientist says, Let me compute the average wait time of red cars at these intersections, and drop said in as a blob of JavaScript attached to every digital twin for an intersection. That is what I mean by an application. And so we want to get to this point where the notional application is not something deeply hidden in somebody's, you know, notebook, or Jupiter notebook, or in some program his brain and they quit and wander off to the next startup 10 months ago, an application is what everyone or no right now grew up into a running model.
Tobias Macey
0:22:02
So the way that sounds like to me is that swim essentially acts as you deploy the infrastructure layer to ingest the data feeds from the sets of sensors, and then it will automatically create these digital twin objects to be able to have some digital manifestation of the real world so that you have a continuous stream of data and how that's interrelated. And then it sort of flips the order of operations in terms of how the data engineer and the data scientists might work together, where the way that most people are used to, you will ingest the data from these different sensors, bundle it up, and then hand it off to a data scientist to be able to do their analyses. They generate a model and then hand it back to the data engineer to say, Okay, go ahead and deploy this and then see what the outputs are where instead, the swim platform essentially acts as the delivery mechanism and the interactive environment for the data scientists to be able to experiment with the data, build a model, and then get it deployed on top of the continuously updating live stream of data, and then be able to have some real world interaction with those sensors, in real time, as they're doing that to be able to feed that back to say, okay, red cars are waiting 15% longer than other cars at these two intersections, and I want to be able to optimize our overall grid, and that will then feed back into the rest of the network to have some physical manifestation of the analysis that they're trying to perform to try and maybe optimizing all traffic.
Simon Crosby
0:23:39
So there are some consequences for that, first of all, the every algorithm has to compute stuff on the fly. So if you look at, you know, the kind of store and then analyze approach to Big Data Type learning, or training or anything else, you know, you have a little bit here, you don't. And so every algorithm that is part of swim, is coded in such a way is to continually process data. And that's fundamentally different to most frameworks. Okay, so for example,
0:24:19
the,
0:24:21
the Learn and predict cycle is what, you know, you mentioned training, and, and so on. And that's very interesting. But no train flies, that I collect and store some train data, and that it's complete and useful enough to try the model back and then hand back. You know, what, if it isn't, and so, in whom we don't do that, mean, we can if you want, if you have a bottle less no problem for us to be dead, too. But instead, in swim, the input vector, say to a prediction, I will say DNA is precisely the current state of the digital twins for some bunch things, right? Maybe the set of sensors in the neighborhood of the urban intersection. And so this is a continuously burying real world triggered scenario in which real data is fed through the algorithm, but is not stored anywhere. So everything is fundamentally streaming. So we assume that data streams continually and indeed, the output of every algorithm streams continually. So what you see when you compete in the average is the current average. Okay, when you see when you when you're looking for heavy hitters, the what you see as the current heavy hitters. All right. And so every algorithm has its streaming, twin, I guess. And and part of the art in the same context is reformulating the notion of of analysis into a streaming context. So that you never expect a complete answer, because there isn't one is just what I've seen until now. Okay, and what I've seen until now has been fed through the algorithm, this is the current answer. And so every algorithm, compute and streams. And so the notion of linking, which I described earlier for swim between digital twin say, applies also to these operators, which effectively would link to things they want to compute from, and then they would stream their results. Okay, so if you LinkedIn, you see a continued update. And for example, that stream could use to be could be used to feed a Kafka cathkin limitation, which would serve a bunch of applications, you know, the notion of streaming is, is pretty well understood. So we can feed other bits of the infrastructure very well. But fundamentally, everything is designed to stream,
Tobias Macey
0:27:21
it's definitely an interesting approach to the just overall workflow of how these analyses work. And one thing that I'm curious of is how data scientists and analysts have found working with this platform in terms of ways that they might be used to, you know, you're interested in, in what they scientists would view or how they view this,
Simon Crosby
0:27:45
to be honest, in general with surprise.
0:27:50
Our experience today has been largely with people who don't know what the heck they're doing in terms of data science. So they're trying to run an oil rig more efficiently they have, what about 10,000 sensors, and they want to make sure this thing isn't going to blow up. Okay? So tend to be heavily operationally focused folks. They're not that scientists, they never could afford one. And they don't understand the language of data science, or have the ability to build cloud based pipelines that you and I might be familiar with. So these are folks who effectively just want to do a better job, given this enormous stream of data they have, they believe they have something in the data, they don't know what that might be. But they came to go and see. Okay. And so those are the folks who spent most of our time with, I'll give you a funny example, if you'd like char man.
Tobias Macey
0:29:00
illustrative,
Simon Crosby
0:29:02
we work with a manufacturer of aircraft.
0:29:05
And they have very large number of RFID tag parts, and equipment to and so if you know anything about RFID, you know, it's pretty useless stuff is built from about 10 years ago, 20 years ago. And so what they were doing is from about 2000, readers, again, about 10,000 reads a second. And each one these read is simply being written into an oracle database, at the end of the day that try and reconcile the soul with what whatever parts have, and wherever the thing is, and so on. And this whom solution to this is entirely different, it gives you a good idea of why we care about modeling data, or thinking about data differently. We simply built a digital twin for every tag, the first time it's seen, we create one. And then they know they have been in for a long time, they just expire. And whenever a reader sees attack, it simply says, Hey, I saw you. And this was the signal strength. Now, because tanks get seen by multiple readers, the each digital 12 attack does the obvious thing. It triangulate from the readers. Okay, so it learns the attenuation different parts of the plot. It's very simple initially, it that's the word learn there is a rather stretch to British straightforward calculation, and then suddenly can work out where it is in three space. So instead of an oracle database, or a database full of tag berries, and lots and lots of post processing, you know, but the Kepler Raspberry Pi's and each one NNE, Raspberry Pi's, you know, have millions of these tanks running, and then you can ask any one of them where it is. Okay, and you then you can do even more, you can say, hey, show me all the things within three meters of this tech, okay? And that allows you to see components being put together into real physical objects, right? So as a fuse, ours gets built up the engine or whatever it is. And so a problem, which was tons of infrastructure, and tons of taghreed got tend to Raspberry Pi's with stuff, which kind of self organized and into a phone, which could feed real time visualization and controls around what what bits of infrastructure were.
0:31:52
Okay. Now, that
0:31:54
was transformative for this outfit, which was, which quite literally had for tackling the problem this way.
Tobias Macey
0:32:02
Does that make sense? Yeah, that's definitely very useful example of how this technology can flip the overall order of operations and just the overall capabilities of an organization to be able to answer useful questions. And the idea of going from, as you said, an Oracle Database full of all these just rows and rows of records of this tag, read at this point in this location. And then being able to actually get something meaningful out of it. As far as this part is in this location in the overall reference space of the warehouse is definitely transformative, and probably gave them weeks or months worth of additional time in terms of lead time for being able to predict problems or identify areas for potential optimization.
Simon Crosby
0:32:47
Yeah, I think we said them $2 million a year. Let me tell you what, from this tale come two interesting things. First of all, if you show up at customer service running on Raspberry Pi, you can charge them a million bucks. Okay, that's less than one lesson too is that the volume of the data is not relevant, or not related to the value of the insight. Okay. I mentioned traffic earlier. In the city of Las Vegas, we get about 20 1516 terabytes per day of the traffic infrastructure. And every intersection, every digital twin, every intersection in the city predicts two minutes into future, okay. And those insights are sold in an API in Azure, to customers like Audi and Uber and Lyft, and whatever else, okay. Now, that's a ton of data, okay, it's just you couldn't even think of where to put in your cloud. But the manual, the inside is relatively low. This is the total amount of money Agni extract from Uber per month per intersection is low. Alright, by the way, all this stuff is open source, you can go grab it, and play and hopefully make your city better. So what from that you can go there, it's not a high enough value for me to do anything other than say, go grab it and run. So vast amounts of data and relatively important, but not commercially relevant value.
Tobias Macey
0:34:35
And another aspect of that case, in particular, is that despite this volume of data, it might be interesting for being able to do historical analyses. But in terms of the actual real world utility, it has a distinct expiration period where you have no real interest in the sensor data as it existed an hour ago, because that has no particular relevance on your current state of the world and what you're trying to do with it at this point in time,
Simon Crosby
0:35:03
yeah, you have historical interest in the sense of wanting to know if your predictions were right, or wanting to know about traffic engineering purposes, which runs on a slower time scale. So some form bucketing or whatever assemble, terse followed, recording is useful. And sure, that easy. But you certainly did not want to record it, there were no dead rate.
Tobias Macey
0:35:30
And then going back to the other question I had earlier, when we were talking about the workflow of an analyst, or a data scientist pushing out their analyses live to these digital twins and potentially having some real world impact. I'm curious if the swim platform has some concept of a dry run mode, where you can deploy this analysis and see what the output of it is without it and see maybe what impact it would have without it actually manifesting in the real world for cases where you want to ensure that you're not accidentally introducing error or potentially having a dangerous outcome, particularly in the case that you were mentioning of an oil and gas rig.
Simon Crosby
0:36:12
Yeah, so I'm with the 1% XE. Everything we've done thus far has been open loop in the sense that we're informing another human or another application, but we're not directly controlling the structure. and the value of a dry run would be enormous, you can imagine in those scenarios, but thus far, we don't have any use cases that we can report of using some for direct control. We do have use cases where on a second by second basis, we are predicting whether machines are going to make an error they make as they build PCBs, for servers, and so on. Then again, what you're doing is you're calling from ladies come over and fix the machine, you're not, you know, you're not trying to change the way the machine bags.
Tobias Macey
0:37:06
And now digging a bit deeper into the actual implementation of swim, I'm wondering if you can talk through how the actual system itself is architected. And some of the ways that it has evolved as you have worked with different partners to deploy it into real world environments and get feedback from them, and how that has affected the overall direction of the product roadmap.
Simon Crosby
0:37:29
So swim is a couple of megabytes of job extensions. Okay? So it's extremely lean, we tend to deploy in containers using the growl VM. To very small, we can run in, you know, probably 100 megabytes or so. And so, people tend to think of when people tend to think of edge, they tend to think of branding in the educated ways or things, we don't really think of Ag that way. And so an important part of defining edge, as far as we're concerned, is simply gaining access to streaming data, we don't really care where it is, but to me small enough to get on limited amounts of compute towards the physical edge. And the, you know, the product has evolved in the sense that, Originally, it was a way of building applications for the edge and you'd sit down, write them in Java, and so on.
0:38:34
laterally, this ability to simply let
0:38:39
let the app application data or let the data build the app, or most of the app can bonus in response
0:38:46
to customer needs.
0:38:49
But swim is deployed, typically in containers, and for that we have in the current release relied very heavy on the Azure IoT edge framework. And that is magical, to be quite honest, because we can rely on Mac soft machinery to deal with all of the painful bits of deployment and lifecycle management for the code base and the application as it runs. These are not things we are really focused on what we're trying to do is build a capability which will respond to data and do the right thing for the application developer. And so we are fully published in the Azure IoT Hub, and you can download this and get going and managers through a cycle that way. And so in several use cases, now, what we're doing is we are use to feed fast time skill, insights at the physical edge, we are labeling data and then dropping it into Azure AD pls, Gen two, and feeding insights into applications built in Power BI. Okay, so it just for the sake of machinery, you know, using the Azure framework for management of the IoT edge, by the way, I think IoT edge is too bad, the worst possible name you could ever pick, because all you want is a thing to manage the lifecycle management of a capability, which is going to deal with fast data. Whether it's at the physical edge or not, is immaterial. But that, but that's basically what we've been doing is relying on Microsoft's fabulous Lifecycle Management Framework for that, plugged into the IoT Hub, and all the Azure IoT small as your services generally, for back end things which enterprises love.
Tobias Macey
0:41:00
Then another element of what we're discussing in the use case, examples that you were describing, particularly, for instance, with the traffic intersections, is the idea of discover ability and routing between these digital twins, as far as how they identify the cardinality of which twins are useful to communicate with and establishing those links, and also at the networking layer, how they handle network failures in terms of communication and ensuring that if there is some sort of fault that they're able to recover from it,
Simon Crosby
0:41:36
symbols, let's talk about two layers. One is the app layer. And the other one is the infrastructure, which is going to run this effective is distributed graph.
0:41:45
And so assume is going to build this graph for us
0:41:49
from the data. What that means is the digital twin, by the way, we technically call these web agents, these little web agents are going to be distributed somewhere a fabric of physical instances, and they may be widely geographically
0:42:06
distributed. And
0:42:08
so there is a need, nonetheless, at the application layer for things which are related in some way linked physically or, you know, in some other way to be able to link to each other that says to
0:42:23
me couldn't have a sub. And so links
0:42:27
require that object, which are the digital twins have the ability to inspect
0:42:33
each other's data,
0:42:34
right, their members, and of course, is something is running on the other side of the planet, and you're linked to it, how on earth is that going to work. So we're all familiar with object oriented languages and objects in one address space, that's pretty easy. We know what an
0:42:50
object handle or an object
0:42:51
reference or a pointer or whatever we get it, but when these things distribute, that's hot, and so in swim with your an application program, where you will simply use object references, but these resolve to your eyes. So in practice, at runtime, the linking is when I link to you, I'll link to your eye. And that link,
0:43:17
will it's resolved by swim
0:43:19
enables a continuous stream of updates to flow from you to me. And if we happen to be on different instances that is running in different address spaces, then there will be over a mash of all my direct web sockets connection between your instance in mind. And so in any swim deployment, all instances are interlinked. So each link to each other using a single web sockets connection, and then these links permit the flow of information between linked digital twins. And what happens is, whenever in a change in the in memory, state of a linked, you know, digital twin happens, what happens is that it's instance, then streams to every other linked object and update to the state for that thing, right. So what are quite what's required is, in effect, a streaming update to Jason, but because of, we're going to record our model in some form of like JSON state or whatever, we would not need to be able to update little bits of it as things change until we use a protocol called warp for that. And that's a swim capability, which we've open sourced. And what that really does is bring streaming to Jason right, streaming updates two parts of a Jason model. And then every instance in swim maintains its own view of the whole model. So as things streaming, the local view of the model is change. But the view of the of the via the world is very much one of a consistency model based on whatever happens to be executing locally and whatever needs to view state certain, eventually consistent dare model, which every node eventually learns the entire thing.
Tobias Macey
0:45:22
And generally, eventually here means you know, a link, so a link away from real time, right, so links delay away from real time. And then the other aspect of the platform is the state fullness of the computation. And as you're saying that that state is eventually consistent dependent on the communication delay between the different nodes within the context graph. And then in terms of data durability, one thing I'm curious of is the length of state, or sort of the overall buffer that is available, I'm guessing is largely dependent on where it happens to be deployed, and what the physical capabilities are of the particular node. And then also, as far as persisting that data for maybe historical analysis, my guess is that that relies on distributing the data to some other system for long term storage. I'm just wondering what the overall sort of pattern or paradigm is for people who want to be able to have that capability?
Simon Crosby
0:46:24
Oh, this is a great question. So in general, the move going from some horrific raw data form on the wire from this the original physical thing to you know, something much more efficient and meaningful in memory, and generally much more concise, so we get a whole ton of dead redaction am I. And so the system is focused on streaming, we don't stop you storing your original data, if you want to, you might just have to discover or whatever the key thing into them is, we don't do that on the hard path. Okay, so things change this day to memory, and maybe compute on that. And that's what they do first and foremost, and then we lately throw things to disk, because disks happens slowly relative to compute. And so typically, what we end up storing is the semantic state of the context graph, as you put it, not the original data.
0:47:23
That is, for example, in traffic world,
0:47:26
you know, we store things like this slide turn red at this particular time, not the voltage on all the registers in the light, and to get massive data reduction. And that form of data is very amenable to storage in the cloud, say or somewhere else. And it's even affordable at, at reasonable rates.
0:47:50
So the key thing for for swimming storage is
0:47:53
you're going to remember as much as you want as much as you have space for locally. And then storage in general, is on the is not on a hot pot, it's not on the computer and string bar and January beginning huge data reductions for every step up the graph we make. So for example, if I go from you know, all the states have all the traffic centers to predictions, then I've made a very substantial reduction in the data remand anyway, right. So as you move up this computational graph, you reduce the amount of data you're going to have to store. And it's up to you really pick what you want to what you want. So
Tobias Macey
0:48:39
in terms of your overall experience, working as the CTO of this organization and shepherding the product direction and the capabilities of this system, I'm wondering what you have found to be some of the most challenging aspects, both from the technical and business sides, and some of the most useful or interesting or unexpected lessons that you've learned in the process.
Simon Crosby
0:49:03
So what's hard is that the real world is not the cloud native world. So we've all seen tablets, examples of Netflix, and Amazon and everybody else doing cool things with data they do. But you know, if you're an oil company, and you have a regarded See, you just don't know how to do this. So, you know, we can come at this, with whatever skill sets we have, what we find is that the real world large enterprises have today are still acres behind the cloud native folk. And that's a challenge. Okay, so getting to be able to understand what they need, because they still have lots of assets, which is generating tons of data is very hard. Second, this notion of edge is continually confusing. And I mentioned previously that, that I would never I've chosen IOTHS, for example, that as your name, because it's not about IoT, or maybe it is, but you may give you two examples. One is traffic lights, say physical things, it's pretty obvious that you're, what the notion of edge is its physical edge. But the other one is this, we build a real time model for millions 10s of millions of headsets for a large mobile carrier in memory, and devolve all the time, right in response to continue to receive signals from these devices,
0:50:38
there is no age,
0:50:40
that is its data and drives over the internet. And we have to figure out where the digital twin for that thing is, and evolve it in real time. Okay, and there, you know, there is no concept of of a of a network to be no or physical edge and traveling over them. We just have to make decisions on the fly and learn and update this month.
0:51:06
So for me, edges, the following thing, edge is stable.
0:51:13
And
0:51:15
cloud is all about rest. Okay, so what I'd say is the fundamental difference between the notion of edge and the notion of cloud that I would like to see, broadly understood is that Where's rest and databases made the cloud very successful, in order to be successful with, you know, this boundless streaming data, state fullness is fundamental, which means rest goes up the door. And we have to move to a model, which is streaming based and staple computation.
Tobias Macey
0:51:50
And then in terms of the future direction, both from the technical and business perspective, I'm wondering what you have planned for both the enterprise product for swim.ai, as well as the open source kernel in the form of CMOS.
Simon Crosby
0:52:06
From an open source perspective, we,
0:52:08
you know, we don't have the advantage of having come up at LinkedIn or something we built it built in us at scale, and be coming out of the startup? Well, we think we built is something which is a phenomenal value. And we're seeing that grow. And our intention is to continually feed their community as much as you can take. And we're just getting more and more stuff ready for open sourcing and ending up.
0:52:36
So we want to see our community
0:52:40
go and explore new use cases for using this stuff, and are totally dedicated to empowering our community. From a commercial perspective, we are focused on honor world, which is edge and moment you said people they tend to get an idea physical edge or something in their heads. And then you know, very quickly, you can get put in a bucket of IoT, I gave an example of say, building a model in real time in AWS for you know, a mobile customer, our intention is to continue to push the bounds of what edge means and and to enable people to build stream pipelines for massive amounts of data easily without complexity, and without the skill set required to invest in these traditionally, fairly heavyweight pipeline components such as beam and flank and, and so on,
0:53:46
to
0:53:47
to enable people to get insights cheaply, and to make the problem of dealing
0:53:51
with new insights from data very easy to solve.
Tobias Macey
0:53:56
And are there any other aspects of your work on swimming is the space of streaming data and digital twins that we didn't discuss yet that you'd like to cover? Before we close out the show?
Simon Crosby
0:54:08
I think we've done pretty good job, you know, I think there are a bunch of parallel efforts. And that's all goodness, that is one of the hardest things has been to get this notion of stapling this more broadly accepted. And I see the function like vendor out there pushing their idea, this a staple functions as service. And really, these are staple amateurs. And there are others out there too. So for me, step number one is to get people to realize that if we're going to take this data that rest and databases are going to kill us, okay? That is there is so much data and the rates are so high that you simply cannot afford to use a stateless paradigm for processing you have to do stay fully. Because, you know, forgetting context every time and then look it up. It's just too expensive.
Tobias Macey
0:55:08
For anybody who wants to follow along with you and get in touch and keeping track of what you're up to. I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today?
Simon Crosby
0:55:26
Well, I think, I mean, there isn't much tooling to be perfect out there a bunch of really fabulous open source code bases and experts in their use. But that's far from tooling. And then there is I guess, an extension of the Power BI downwards. Were, which is something like the monster Excel spreadsheet world, right? So you find all these folks who are pushing that kind of you no end user model of data, doing great things, but leaving a huge gap between the consumer of the insight and the data itself is assuming the data is already there in some good form and can be put into spiritual view, whatever it happens to be. So there's this huge gap in the middle, which is how do we build the model? What does the model tell us? Just off the bat, how do we do this reconstructive Lee in large numbers situations? And then how do we dynamically insert operators which are going to compute useful things for us on the fly in writing models?
Tobias Macey
0:56:44
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on the swim platform. It's definitely a very interesting approach to data management and analytics, and I look forward to seeing the direction that you take it in the future. So I appreciate your time on that. I hope you enjoy the rest of your day.
Simon Crosby
0:57:01
Thanks very much. You've been great be
Tobias Macey
0:57:03
Thank you for listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

Building A Reliable And Performant Router For Observability Data - Episode 97

Summary

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what the Vector project is and your reason for creating it?
    • What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project?
  • What strategy are you using for project governance and sustainability?
  • What are the main use cases that Vector enables?
  • Can you explain how Vector is implemented and how the system design has evolved since you began working on it?
    • How did your experience building the business and products for Timber influence and inform your work on Vector?
    • When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust?
    • What led you to choose Lua as the embedded scripting environment?
  • What data format does Vector use internally?
    • Is there any support for defining and enforcing schemas?
      • In the event of a malformed message is there any capacity for a dead letter queue?
  • What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data?
  • When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations?
  • What options are available to operators to support visibility into the running system?
  • In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy?
  • What are some of the other considerations that operators and administrators of Vector should be considering?
  • You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap?
  • What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink)
  • What are some of the most interesting/innovative/unexpected ways that you have seen Vector used?
  • What are some of the challenges that you have faced in building/publicizing Vector?
  • For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently?
    • What is missing that you would consider necessary for production readiness?
  • When is Vector the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Lynn ODE. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Ben Johnson and Luke Steenson about vector, a high performance open source observe ability data router. So Ben, can you start by introducing yourself?
Ben Johnson
0:01:47
Sure. My name is Ben. I am the co founder CTO temper IO.
Tobias Macey
0:01:53
And Luke, How about yourself?
Luke Steensen
0:01:54
Yeah. I'm Luke Steenson. I'm an engineer at timber.
Tobias Macey
0:01:58
And Ben, going back to you. Do you remember how you first got involved in the area of data management?
Ben Johnson
0:02:02
Yeah. So I mean, just being an engineer, obviously, you get involved in it through observing your systems. And so before we started timber, I was an engineer at sea gig, we dealt with all kinds of observe ability challenges there.
Tobias Macey
0:02:16
And Luke, do you remember how you first got involved in the area of data management?
Luke Steensen
0:02:20
Yeah, so at my last job, I ended up working with Kafka quite a bit in a in a few different contexts. So I ended up getting getting pretty involved with that project, leading some of our internal Stream Processing projects that we were trying to get off the ground, and I just found it, you know, it's a very interesting space. And the more that you dig into a lot of different engineering problems it does, it ends up boiling down to to managing your data, especially when you have a lot of it, it kind of becomes the the primary challenge, and limits a lot of what you're able to do. So kind of the more tools and techniques you you have to address those issues and use those kind of design tools, the further you can get, I think,
Tobias Macey
0:03:09
and so you both [email protected], and you have begun work on this vector project. So I'm wondering if you can explain a bit about what it is and the overall reason that you had for creating it in the first place? Yeah, sure.
Ben Johnson
0:03:21
So on this on the most basic sounds of vectors, and observable the data router and collects data from anywhere in your infrastructure, whether that be a log file over TCP socket, and can be stats, D metrics, and then vector is designed to ingest that data and then send it to multiple sources. And so the idea is that it is sort of vendor agnostic and collects data from many sources and sends it to many things. And the reason we created it was really, for a number of reasons, I would say, one, you know, being an observer ability company, and when we initially launched number, it was a hosted a blogging platform. And we needed a way to collect our customers data, we tried writing our own, initially and go that was very just kind of specific to our platform. That was that was very difficult. We started using off the shelf solutions, and found those also be difficult, we were getting a lot of support requests, it was hard for us to contribute and debug them. And then I think in general, you know, our our ethos as a company is we want to create a world where developers have choice and are locked into specific technologies are able to move with the times choose best in class tools for the job. And that's kind of what prompted us to start vectors. That vision, I think, is enabled by an open collector that is vendor agnostic, and meets a quality standard that makes people want to use it. So it looks like we have other areas in this podcast where we'll get into some of the details there. But we really wanted to raise the bar on the open collectors and start to give control and ownership back to the people, the developers that were deploying vector.
Tobias Macey
0:05:14
And as you mentioned, there are a number of other off the shelf solutions that are available. Personally, I've had a lot of experience with fluent D. And I know that there are other systems coming from the elastic stack and other areas. I'm curious, what are some of the tools that you consider to be comparable or operating in the same space? And any of the ones that you've had experience with that you found to be lacking? And what were the particular areas that you felt needed to be addressed that weren't being handled by those other solutions?
Ben Johnson
0:05:45
Yeah, I think that's a really good question. So first, I would probably classify the collectors as either open or not. And so typically, I wouldn't, we're not too concerned with vendor specific collectors, like the spawn corridor, or anything another sort of, you know, thing that just ships data, one service. So I'd say that, you know, in the category of just comparing tools, I'll focus on the ones that are open, like you said, flinty filebeat LogStash, like, I think it's questionable that they're completely open. But I think we're more comparable to those tools. And then I'd also say that, like, we're, we typically try to stay away from like, I don't want to say anything negative about the projects, because I, a lot of them were pieces of inspiration for us. And so, you know, we respect the fact that they are open, and they were solving a problem at the time. But I'd say one of the projects that that really, we thought was a great alternative and inspired us is one called Cernan. It was built by Postmates. It's also written and rest. And that kind of opened our eyes a little bit like a new bar, a new standard that you could set with these, these collectors. And we actually know Brian Trautwein, he was one of the developers that worked on it. He's been really friendly and helpful to us. But the sort of thing that the reason we didn't use certain is like one, it's, it was created out of necessity of Postmates. And it doesn't seem to be actively maintained. And so that's one of the big reasons we started vector. And so I would say that's, that's something that's lacking is like, you know, a project that is actively maintained and is in it for the long term. Obviously, that's, that's important. And then in terms of like actual specifics of these projects. There's a lot that I could say here. But you know, on one hand, we've seen a trend of certain tools that are built for a very specific storage, and then sort of like backed into supporting more sinks. And it seems like the like incentives and sort of fundamental practices of those tools are not aligned with many disparate storage is that kind of ingest data differently, for example, like the fundamentals of like badging and stream processing, I think thinking about those two ways of like, collecting data and sending it downstream kind of don't work for every single storage that you want to support. The other thing is just the obvious ones like performance, reliability, having no dependencies, you know, if you're not a strong Java shop, you probably aren't comfortable deploying something like LogStash and then managing the JVM and everything associated with that. And, yeah, I think another thing is we want to collector that was like, fully vendor agnostic and vendor neutral. And a lot of them don't necessarily fall into that bucket. And as I said before, like that's something we really strongly believe in as an observer ability world where developers can rely on a best in class tool that is not biased and has aligns incentives with the people using it, because there's just so many benefits that stem off of that.
Tobias Macey
0:08:51
And on the point of sustainability, and openness, I'm curious, since you are part of a company, and this is and some ways related to the product offering that you have how you're approaching issues, such as product governance and sustainability and ensuring that the overall direction of the project is remaining impartial and open and trying to foster a community around it so that it's not entirely reliant on the direction that you try to take it internally and that you're incorporating input from other people who are trying to use it for their specific use cases.
Ben Johnson
0:09:28
Yeah, I think that's a great question.
0:09:31
So one is we want to be totally transparent on everything, like everything we do with vector discussions, specifications, roadmap planning, it's all available on GitHub. And so nothing is private there. And we want factor to truly be an open project that anyone can contribute to. And then, in terms of like governance and sustainability, like we try to do a really good job. just maintaining the project. So number one is like good as you management, like making sure that that's, that's done properly, helps people like search for issues helps them understand like, which issues need help, like what are good first issues to start contributing on. We wrote an entire contribution guide and actually spent good time and put some serious thought into that so that people understand like, what are the principles of vector and like, how do you get started. And then I think the other thing that really sets vector apart is like the documentation. And I think that's actually very important for sustainability. And helping to it's really kind of like a reflection of your projects, respect for the users in a way. But it also serves as a really good opportunity to like explain the project and help people understand like the internals the project, and how to how to contribute to it. So really kind of all comes together. But I'd say yeah, the number one thing is just transparency, and making sure everything we do is out in the open.
Tobias Macey
0:11:00
And then in terms of the use cases, that vector enables, obviously, one of them is just being able to process logs from a source to a destination. But in the broader sense, what are some of the ways that vector is being used both at timber and with other users and organizations that have started experimenting with it beyond just the simple case.
Ben Johnson
0:11:20
So first, like vectors, news are, we're still learning a lot as we go. But, you know, the core use cases, the business use cases we see is there's everything from reducing cost vectors quite a bit more efficient, the most collectors out there. So just by deploying vector, you're going to be using less CPU cycles, less memory, and you'll have more of that available for the app that's running on that server, how side of that it's like the fact that vector enables choosing multiple storage is and the storage that is best for your use case, lets you reduce costs as well. So for example, you know, like, if you're running an elk stack, you don't necessarily want to use your, before archiving, you can use another cheaper, durable storage for that purpose and sort of take the responsibility out of your elk stack. And that reduces costs in that way. So I think that's like an interesting way to use vector. Another one is, like I said before, reducing lock in that use cases is so powerful, because it gives you the agility, choice control, ownership of your data. Transitioning vendors is a big use case, we've seen so many companies we talked to or bogged down and locked in to vendors, and they're tired of paying the bill, but they don't see a clean way out. And like observer abilities. And it is an interesting problem. Because it's not just technology coupling, like there are human workflows that are coupled with the systems you're using. And so transitioning out of something that maybe isn't working for your organization anymore, requires a bridge. And so a vector is a really great way to do that, like deploy vector, continue sending to whatever vendor you're using. And then you can, at the same time start to try out other stages. And like other setups without disrupting, like the human workflows in your organization, and I can keep going, there's data governance, we've seen people you know, cleaning up their data and forcing schemas, security and compliance, you have the ability to like scrub sensitive data at the source before it even goes downstream. And so you know, again, like having a good open tool like this is so incredibly powerful, because of all of those use cases that it enables. And, like, lets you take advantage of those when you're ready.
Tobias Macey
0:13:36
In terms of the actual implementation of the project, you've already mentioned in passing that it was written in rust. And I'm wondering if you can dig into the overall system architecture and implementation of the project and some of the ways that it has evolved since you first began working on it, like you said, rust is,
Luke Steensen
0:13:53
I mean, that's kind of first thing everybody looks at certain interest.
0:13:57
And kind of on top of that, we're we're building with the, like the Tokyo asynchronous i O, kind of stack of, of libraries and tools within the rust ecosystem. Kind of from the beginning, we we've started vector, pretty simple, architecturally. And we're kind of we have an eye on, on where we'd like to be. But we're trying to get there very, very incrementally. So at a high level, each of the internal components of vectors is generally either a source of transform or sink. So, so probably familiar terms, if you if you dealt with this type of tool before, but sources are something that helps you ingest data and transforms, different things you can do like parsing JSON data into, you know, our internal data format, doing regular expression, value extracting, like Ben mentioned, and forcing schemas, all kinds of stuff like that. And then syncs, obviously, which is where we will actually forward that data downstream to some external storage system or service or something like that. So that those are kind of the high level pieces, we have some different patterns around each of those. And obviously, there's different different flavors. You know, if you're, if you have a UDP sis logs source, that's, that's going to look and operate a lot differently than a file tailing source. So there's a lot of those, there's a lot of different styles of implementation, but they all we kind of fit them into those three buckets of source, transform and sink. And then the way that you configure vector, you're, you're basically building a data flow graph, where data comes in through a source, flow through any number of transforms, and then down the graph into a sync or multiple things, we try to keep it as flexible as possible. So you can, you can pretty much build like an arbitrary graph of data flow, obviously, there are going to be situations where that that isn't, you know, you could build something that's this pathological or won't perform well, but we've kind of leaned towards giving users the flexibility to do what you want. So if you want to, you know, parse something as JSON and then use a reg ex to extract something out of one of those fields, and then enforce a schema and drop some fields, you can kind of chain all these things together. And you can, you can kind of have them fan out into different transforms and merge back together into a single sink or feed two sinks from the same transform, output, all that kind of stuff. So basically, we try to keep it very flexible, we definitely don't advertise ourselves as like a general purpose stream processor. But there's a lot of influence from working with those kinds of systems that has found its way into the design of vector.
Tobias Macey
0:17:09
Yeah, the ability to map together different components of the overall flow is definitely useful. And I've been using fluid D for a while, which has some measure of that capability. But it's also somewhat constrained in that the logic of the actual pipeline flow is dependent on the order of specification and the configuration document, which is sometimes a bit difficult to understand exactly how to structure the document to make sure that everything is functioning as properly. And there are some mechanisms for being able to route things slightly out of band with particular syntax, but just managing it has gotten to be somewhat complex. So when I was looking through the documentation for vector, I appreciated the option of being able to simply say that the input to one of the steps is linked to to the ID of one of the previous steps so that you're not necessarily constrained by order of definition, and that you can instead just use the ID references to ensure that the flows are Yeah, that
Luke Steensen
0:18:12
was definitely really something that we spent a lot of time thinking about. And we still spend a lot of time thinking about, because, you know, if you kind of squint at these config files, they're, they're kind of like a program that you're writing, you know, you have data inputs and processing steps and data outputs. So you, you want to make sure that that flow is clear to people. And you also want to make sure that, you know that there aren't going to be any surprises you don't want. I know a lot of tools, like you mentioned, have this as kind of an implicit part of the way the config is written, which can be difficult to manage, we wanted to make it as explicit as possible. But also in a way that is still relatively readable. From a, you know, just when you open up the config file, we've gone with a pretty simple toggle format. And then like you mentioned, you just kind of mentioned, you just kind of specify which input each component should draw from, we have had some kind of ideas and discussions about what our own configuration file format would look like. I mean, we've what we would love to do eventually is make these kind of pipelines as much as as pleasant to write as something like, like a bash pipeline, which we think that's another really powerful inspiration for us. Obviously, they have their limitations. But the things that you can do, just in a bash pipeline with a, you have a log file, you grab things out, you run it through, there's all kinds of really cool stuff that you can do in like a lightweight way. And that's something that we've we've put a little thought into, how can we be as close to that level of like, power and flexibility, while avoiding a lot of the limitations of, you know, obviously, being a single tool on a single machine. And, you know, I don't want to get into all the, the gotchas that come along with writing bash one liners, obviously, there, there are a lot. But if we want something that we want to kind of take as much of the good parts from as possible.
Tobias Macey
0:20:33
And then in terms of your decision process for the actual runtime implementation for both the actual engine itself, as well as the scripting layer that you implemented in Lua? What was the decision process that went into that as far as choosing and settling on rust? And what were the overall considerations and requirements that you had as part of that decision process.
Luke Steensen
0:20:57
So from a high level, the thing that we thought were most important when writing this tool, which, which is obviously going to run on other people's machines, and hopefully run on a lot of other people's machines, we want to be, you know, respectful of the fact that they're, you know, willing to put our tool on a lot of their, their boxes. So we don't want to use a lot of memory, we don't want to use a lot of CPU, we want to be as resource constrained as possible. So So efficiency is a big was a big point for us. Which rust obviously gives you the ability to do, there's you know, I'm a big fan of Russ. So I could probably talk for a long time about all the all the wonderful features and things. But honestly, the fact that it's a it's a tool that lets you write, you know, very efficient programs, control your memory use pretty tightly. That's somewhere that we I have a pretty big advantage over a lot of other tools. And then just I was the first engineer on the project, and I know rust quite well. So just kind of the the human aspect of it, it made sense for us, we're lucky to have a couple people at timber who are who are very, very good with rest very familiar and involved in the community. So it has worked out, I think it's a it's worked out very well from the embedded scripting perspective, Lua was kind of an obvious, obvious first choice for us. There's very good precedent for for using Lua in this manner. For example, in engine x and h a proxy. They both have local environments that lets you do a lot of amazing things that you would maybe never expect to be able to do with those tools, you can write a little bit of Lua. And there you go, you're all set. So lou is very much built for this purpose. It's it's kind of built as an embedded language. And there were, there's a mature implementation of bindings for us. So didn't take a lot of work to integrate Lua and we have a lot of confidence that it's a reasonably performant reliable thing that we can kind of drop in and expect to work. That being said, it's it's definitely not the end all be all. We know that while people can be familiar with Lua from a lot of different areas where it's used, like gaming and our game development. And like I mentioned some observe ability tools or infrastructure tools, we are interested in supporting more than just Lua, we actually have a work in progress, JavaScript transform, that will allow people to kind of use that as an alternative engine for transformations. And then a little bit longer term we this is we kind of want this area to mature a little bit before we dive in. But the was awesome space has been super interesting. And I think that, particularly from a flexibility and performance perspective could give us a platform to do some really interesting things in the future.
Tobias Macey
0:24:06
Yeah, I definitely think that the web assembly area is an interesting space to keep an eye on because of the fact that it is, in some ways being targeted as sort of a universal runtime that multiple different languages can target. And then in terms of your choice of rust, another benefit that it has, when you're discussing the memory efficiency is the guaranteed memory safety, which is certainly important when you're running it in customer environments, because that way, you're less likely to have memory leaks or accidentally crashed their servers because of a bug in your implementation. So I definitely think that that's a good choice as well. And then one other aspect of the choice of rest for the implementation language that I'm curious about is how that has impacted the overall uptake of users who are looking to contribute to the project either because they're interested in learning rust, but also in terms of people who aren't necessarily familiar with the Ruston any barriers that that may pose,
Luke Steensen
0:25:02
it's something that's kind of hard, it's hard to know, because obviously we can we didn't can't inspect the alternate timeline where we we wrote it and go or something like that, I would say that there's kind of there's there's ups and downs from a lot of different perspectives from like a from an developer interest perspective, I think rust is something that a lot of people find interesting. Now the, the sales pitch is a good one, and a lot of people find it compelling. So I think it's definitely, you know, it's caught a few more people's interest because it happens to be written in rust, we, we try not to push on that too hard, because of course, there's, there's the other set of people who, who do not like rust and are very tired of hearing about it. So, you know, we love it, and we're very happy with it. But we try not to make it, you know, a primary marketing point or anything like that. But I think it does, it does help to some degree. And then from a contributions perspective, again, it's hard to say for sure, but I do know from experience that we have had, you know, a handful of people kind of pop up from the open source community and give us some some really high quality contributions. And we've been really happy with that, like I said, we can't really know how that would compare to, if we had written it in a language that more people are proficient in. But the contributions from the community that we have seen so far have been, like I said, really high quality, and we're really happy with it, the the JavaScript transform that I mentioned, is actually something that's a good example of that we had a contributor come in and do a ton of really great work to, to make that a reality. And it's something that we're pretty close to being able to merge and ship. So that's something that I definitely shared a little bit of that concern, I was like, I know, rust, at least has the reputation of being a more difficult to learn language. But at the the community is there, there's a lot of really skilled developers that are interested in rust and you know, would love to have an open source project like vector that they can contribute to. And we've seen,
Tobias Macey
0:27:12
we've definitely seen a lot of benefit from that, in terms of the internals of vector, I'm curious how the data itself is represented once it is ingested in the sink, and how you process it through the transforms, as far as if there's a particular data format that you use internally in memory, and also any capability for schema enforcement as it's being flowed through vector out to the sinks.
Luke Steensen
0:27:39
Yeah, so right now we have our own internal our own in memory, data format, it's kind of it's a little bit, I don't want to say thrown together. But it's something that's been incrementally evolving pretty rapidly as we build up the number of different sources and things that we support. This was actually something that we deliberately kind of set out to do when we were building vectors, we didn't want to start with the data model. You know, there are some projects that do that. And that's, I think, there's definitely a space for that. But the data modeling and the observe ability, space is, is not always the best. But we explicitly kind of wanted to leave that other people. And we were going to start with the simplest possible thing. And then kind of add features up as we found that we, we needed them in order to better support the data models of the downstream sinks and the transforms that we want it to be able to do. So from from day one, that the data model was basically just string, you know, you send us a log message, and we represented as a as a string of characters. Obviously, it's grown a lot since then. But we basically now support we call them events internally, it's kind of our, our vague name for everything that flows through the system. events can be a log, or they can be a metric through a metric, we support a number of different types, including counters, gauges, kind of all your standard types of metrics from like the stats, D family of tools, and then logs, that can be just a message. Like I said, just a string, we still support that as much as we ever have. But we also support more structured data. So right now, it's a flat map of string, you know, a map of string to something, we have a variety of different types that the values can be. And that's also something that's kind of growing as we want to better support different tools. So right now, it's kind of like non nested JSON ish representation. In memory, we don't actually see realize it to JSON, we support a few extra types, like timestamps, and things like that, that are important for our use case. But in general, that's kind of how you can think about it, we have, we have a protocol buffers schema for that data format that we use when we serialized to disk when we do some of our on disk buffering. But that is I wouldn't that's necessarily the primary representation, we when you work with it in a in a transform your, you're looking at that, that in memory representation that like I said, kind of looks a lot like JSON. And that's something that we're we're kind of constantly reevaluating and thinking about how we want to evolve. I think kind of the next, the next step in that evolution is to make it not necessarily just a flattened map and move it towards supporting like, nested maps, nested keys. So it's going to move more towards like an actual, you know, full JSON, with better types and support for things like that.
Tobias Macey
0:30:39
And on the reliability front, you mentioned briefly the idea of disk buffering. And that's definitely something that is necessary for the case where you need to restart the service and you don't want to lose messages that have been sent to an aggregator node, for instance, I'm curious, what are some of the overall capabilities in vector the support this reliability objective, and also, in terms of things such as malformed messages, if you're trying to enforce the schema, if there's any way of putting those into dead letter Q for reprocessing, or anything along those lines?
Luke Steensen
0:31:14
Yeah, dead letter Q specifically, isn't something that we support at the moment. That's it is something that we've been thinking about, and we do want to come up with a good way to support that. But currently, that isn't something that we have a lot of transforms like the the schema enforcement transform will end up just just dropping the events that don't, or it will, if it can't enforce that they do meet the schema by dropping fields, for example, it will, it will just drop the event, which, you know, we're we recognize the the shortcomings there. I think one of the one of the things that is a little bit nice from an implementation perspective about working in the observe ability space, as opposed to, you know, the normal data streaming world with application data, is that people can be a little bit more, there's more of an expectation of best effort, which is something that we're willing to take advantage of a little bit in, like the very beginning early stages of a certain feature or tool, but it but it's definitely a part of the part of the ecosystem that we want to push forward. So it's, that's something that we we try to keep in mind as we build all this stuff is yes, it might be okay. Now, we may have parody with other tools, for example, if we just got messages that don't need a certain schema, but you know, how can we how can we do better than that other tools that are other kind of things in the toolbox that you can reach for for this type of thing, or, I mean, the most basic one would be that you can send data to multiple things. So if you have a kind of classic sis log like setup, where you're forwarding logs around, and it's super, super simple to just add, secondary, that will forward to both CES log aggregator a and CES log aggregator be. That's, that's, that's nothing particularly groundbreaking, but it's something that is kind of the start beyond that. I mentioned, the the disk buffer where we want to make do it as good a job as we can, ensuring that we don't lose your data, once you have sent it to us, we are we are still a single node tool at this point where we're not a distributed storage system. So there are going to be some inherent limitations in in the guarantees that we can provide you there, we do recommend, you know, if you if you really want to make sure that you're not losing any data at all vector is going to is not going to be able to give you the guarantees that something like Kafka would. So we want to make sure that we work well with tools like Kafka, that are going to give you pretty solid, you know, redundant reliable distributed storage guarantees. Let's see, other than those two, we writing the tool in rust is, you know, kind of an indirect way that we want to try to make it just as reliable as possible. I think rust has a little bit of a reputation for making it tricky to do things, you know that the compiler is very picky and wants to make sure that everything you're doing is safe. And that's something that you can you definitely take advantage of to kind of guide you in writing, you mentioned, like memory safe code, but it kind of expands beyond that into ensuring that every error that pops up, you're going to have your handling explicitly at that level or a level above and things like that it kind of guides you into writing more reliable code by default, obviously, it's still on you to make sure that you're covering all the cases and things like that, but it definitely helps. And then moving forward, we're we're going to spend a lot of time and the very near future, setting up certain kind of internal torture environments, if you will, where we can run vector for long periods of time and kind of induce certain failures in the network and you know, the upstream services, maybe delete some data from underneath it on disk and that kind of thing. kind of familiar, if you're familiar with the the Jepsen suite of database testing tools, obviously, we don't have quite the same types of invariance that an actual database would have. But I think we can use a lot of those techniques to kind of stress vector and see how it responds. And like I said, we're going to be limited and what we can do based off of the fact that we're a single node system. And you know, if you're sending us data over UDP, there's not a ton of guarantees that we're going to be able to give you. But to the degree that we're able to give guarantees, we very much would like to do that. So that's that's definitely is a focus of ours, we're going to be exploring that as much as possible.
Tobias Macey
0:36:03
And then, in terms of the deployment, apologies that are available, you mentioned one situation where you're forwarding to a Kafka topic. But I'm curious what other options there are for ensuring high availability, and just the overall uptime of the system for being able to deliver messages or events or data from the source to the various destinations.
Luke Steensen
0:36:28
Yeah, there are a few different kind of general topology patterns that we you know, we've documented and we we recommend to people, I mean, the simplest one, depending on how your infrastructure setup can just be to run vector on each of your, you know, application servers, or whatever it is that you have. And kind of run them there in a very distributed manner. And forward to, you know, if you are sending it to a certain upstream logging service or, or something like that, you can kind of do that where it's, you don't necessarily have any aggregation happening in your infrastructure. That's pretty easy to get started with. But it does have limitations. If you know, if you don't want to allow outbound internet access, for example, from from each of your nodes, the next kind of step, like you mentioned is, you know, we would support a second kind of tier of vector running maybe on a dedicated box, and you would have a number of nodes forward to this more centralized aggregator node, and then that node would forward to whatever other you know, sinks that you have in mind, that's kind of the second level of complexity, I would say, you do get some benefits in that you have some more power to do things like aggregations and sampling in a centralized manner, there's going to be certain things that you can't necessarily do if you never bring the data together. And you can do that, especially if you're looking to reduce cost, it's nice to be able to have that that aggregator node kind of as a as a leverage point where you can bring everything together, evaluate what is, you know, most important for you to forward to different places, and do that there. And then kind of the, for people who want more reliability than a, you know, a single aggregation node at this point, our recommendation is something like Kafka that that's going to give you distributed durable storage, we that that is a big jump in terms of infrastructure complexity. So there's definitely room for some in betweens there that we're working on in terms of, you know, having a fail over option. Like right now, you could put a couple aggregator knows bomb behind a TCP load balancer or something like that, that's not necessarily going to be the best experience. So we're kind of investigating our options there to try to give people a good range of choices for you know, how much they're willing to invest in the infrastructure, and what kind of reliability and robustness benefits that they
Tobias Macey
0:39:19
that they need. Another aspect of the operational characteristics of the system are being able to have visibility into particularly at the aggregate or level, what the current status is of the buffering or any errors that are cropping up, and just the overall system capacity. And I'm curious if there's any current capability for that, or what the future plans are along those lines.
Luke Steensen
0:39:44
Um, yeah, we have some we have a setup for for kind of our own internal metrics at this point, that that is another thing that we're kind of alongside the liabilities, stuff that you mentioned that that we're really looking at very closely right now. And what what we want to do next, we've kind of the way we've set ourselves up, we have kind of an event based system internally, where it can be emitted normally as log events, but we also have the means to essentially send them through something like, like a vector pipeline, where we can do aggregations, and kind of filter and sample and do that kind of stuff to get better insight into what's happening in the process. So we haven't gotten as far as I'd like down that road at this point. But I think we have a pretty solid foundation to do some some interesting things. And and it's going to be definitely a point of focus in the next, you know, few weeks.
Tobias Macey
0:40:50
So in terms of the overall roadmap, you've got a fairly detailed set of features and capabilities that you're looking to implement. And I'm wondering what your decision process was in terms of the priority ordering of those features, and how you identified what the necessary set was for a one dot o release.
Ben Johnson
0:41:12
So initially, when we planned out the project, you know, our roadmap was largely influenced by our past experiences, you know, not only supporting timber customers, but running around observer building tools. And just based on the previous questions you asked, was obvious to us that we would need to support those different type of deployment models. And so a lot of it's a part of the roadmap was dictated by that. So you can see, like, later on the roadmap, we want to support stream processors, so we can enable that sort of deployment topology. And, yeah, it was kind of, it's very much evolving, though, as we learn and kind of collect data from customers and their use cases, we're actually are going to make some changes to it. But and in terms of a 1.0 release, like everything that you see, and the roadmap on GitHub, which are represented as milestones, we think that sort of represents, like a 1.0 release for us represents something a reasonably sized company could deploy and rely on vector. And so, you know, again, given our experience, a lot of that is dependent on Kafka, or some sort of some sort of more complex topology, as it relates to collecting your data and routing it downstream.
Tobias Macey
0:42:38
And then, in terms of the current state of the system, how would you characterize the overall production readiness of it, and whatever, and any features that are currently missing that you think would be necessary for a medium to large scale company to be able to adopt it readily?
Ben Johnson
0:42:55
Yeah. So in terms of like a one point release, where we would recommend it to for like, very stringent production use cases. I think what Luke just talked about internal metrics, I think it's really important that we improve vectors own internal observer ability, and provide operators the tools necessary to monitor performance, set up alarms and make sure that they have confidence in factor internally, the stress testing is also something that would raise our confidence, and that we have a lot of interesting stress testing use cases that we want to run vector through. And I think that'll expose some problems. But I think getting that done would definitely raise our confidence. And then I think there's just some like, General house cleanup that I think would be helpful for one point or release. Like, you know, the initial stages of this project have been inherently a little more messy, because we are building out the foundation and moving pretty quickly with our integrations. I would like to see that settle down more when we get to 1.0. So that we have smaller increment releases, and we take breaking changes incredibly seriously factors reliability, and sort of least surprise, philosophy definitely plays into like how we're releasing the software and making sure that we aren't releasing a minor update that actually has breaking changes in it, for example. So I would say those are the main things missing. Before we can officially call it one point O outside of that, the one other thing that we want to do is provide more education on some high level use cases around vector. I think right now, it's like the documentation is is very good. And that it, like dives deep into all the different components like sources, sinks and transforms and all the options available. But I think we're lacking in more guidance around like how you deploy vector and an AWS environment or a GCC environment. And that's, that's certainly not needed for 1.0. But I think it is one of the big missing pieces that will make Dr. More of a joy to us
Tobias Macey
0:44:55
in terms of the integrations, what are some of the ways that people can add new capabilities to the system? Does it require being compiled into the static binary? Or are there other integration points where somebody can add a plug in. And then also, in terms of just use of the system, I'm curious what options there are as far as being able to test out a configuration to make sure that the content flow is what you're expecting.
Luke Steensen
0:45:21
So in terms of plugins, basically, that's the we don't have a strong concept of that right now, all of the transforms that I've mentioned, sources and sinks are all written in rust and kind of natively compiled into the system that has a lot of benefits, obviously, in terms of performance, and we get to make sure that everything fits in perfectly ahead of time. But obviously, it's it's not quite as extensible as we'd like at that point. So there, there are a number of strategies that we've we've thought about for allowing kind of more user provided plugins, I know, I know, that's a big thing feature of fluent D that people get a lot of use out of. So it is something that we'd like to support. But we want to be careful how we do it. Because, you know, we don't want to give up our core strength necessarily, which I'd say, you know, the kind of the performance and robustness reliability of the system, we want to be careful how we expose those extension points to kind of make sure that the system as a whole maintains those properties that that we find most valuable. So
Ben Johnson
0:46:29
yeah, and that's to echo Lake on that, like we've seen, you know, plugin plugin ecosystems are incredibly valuable, but they can be very dangerous, like they can ruin a projects reputation as it relates to reliability and performance. And we've seen that firsthand with a lot of the different interesting fluidity setups that we've seen with our customers, they'll use off the shelf plugins that aren't necessarily written properly or maintained actively, and it just implements, it adds this variable to just the discussion of running vector that makes it very hard to ensure that it's meeting the reliability and performance standards that we want to meet. And so I would say that if we do introduce the plugin system, there will be it'll be quite a bit different than I think what people are expecting. That's something that we're taking, we're putting a lot of thought into. And, you know, to go back to some of the things you said before, it's like we've had community contributions, and they're very high quality. But those still go through a code review process that exposes quite a bit of quite a bit of like, fundamental differences and issues in the code that would have otherwise not been taught. And so it's, it's an interesting kind of like conundrum to be in is like I, on the one hand, we like that process, because it ensures quality on the other it is a blocker and certain use cases.
Luke Steensen
0:47:48
Yeah, I think our strategy there so far has basically been to allow program ability in limited places, for example, the Lua transform and the kind of upcoming JavaScript transform, there's got kind of a surprising amount that you can do, even when you're limited to that, to that context of a single transformation, we are interested in kind of extending that to say, you know, is it would it make sense to have a sink or a source that you could write a lot of the logic in, in something like Lua or JavaScript or, you know, a language compiled to web assembly. And then we provide almost like a standard library of you know, io functions and things like that, that would we would have more control over and could do a little bit more to ensure, like Ben said, the performance and reliability of the system. And then kind of the final thing is we really want vector to be as as easy to contribute to as possible. Ben mentioned, some, you know, housekeeping things that we want to want to do before we really considered at 1.0. And I think a lot of that is around extracting common patterns for things like sources things and transforms into to kind of our internal library so that if you want to come in and contribute support to vector for a new downstream database, or metric service or something like that, we want that process to be as simple as possible. And we want you to kind of be guided into the right path in terms of, you know, handling your errors and doing retrials by default, and all all of that stuff, we want it to be right there and very easy. So that we can minimize, there's always going to be a barrier if you say you have to write a pull request to get support for something as opposed to just writing a plugin. But we want to minimize that as much as we possibly can.
Tobias Macey
0:49:38
And there are a whole bunch of other aspects of this project that we haven't touched on yet that I have gone through in the documentation that I think is interesting, but I'm wondering if there is anything in particular that either of you would like to discuss further before we start to close out the show.
Ben Johnson
0:49:55
And in terms of like the actual technical implementation of vector, I think one of the unique things things that is worth mentioning is one of you know, vectors, and tend to be the single data collector, across all of your different types of data. So we think that's a big gap in the industry right now is that you need different tools for metrics and logs, and exceptions, and traces. And so we think we can really simplify that. And that's one of the things that we didn't touch on very well in this in this podcast. But right now, we support logs and metrics. And we're considering expanding support for different types of observe ability data, so that you can claim full ownership and control of collection of that data
Luke Steensen
0:50:36
and routing of it. Yeah, I mean, I could there, you know, small little technical things within vector that I think are neat talking about for a little while, but I mean, for me, the most interesting part of the project is kind of viewing it through the lens of being a kind of a platform that you program that it's you know, as flexible and programmable, I guess as possible, kind of in the in the vein of you know, those bash one liners that I talked about, that's something that it you know, that can be a lot of fun can be very productive. And the challenge of kind of lifting that thing that you do in the small on your own computer or on a single server or something like that up to a distributed context, I find it you know, a really interesting challenge. And there's a lot of fun little pieces that you get to put together as you try to try to move in that direction.
Tobias Macey
0:51:28
Well, I'm definitely going to be keeping an eye on this project. And for anybody who wants to follow along with you, or get in touch with either of you and keep track of the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, and I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Luke Steensen
0:51:49
For me, I think there's there's so many interesting stream processing systems, databases, tools, and things like that. But there hasn't been quite as much attention paid to the glue, like how do you get your data in? How do you integrate these things together, and that ends up being like a big barrier for getting people to get into these tools and get a lot of value out of them, there's just, there's a really high activation energy. Already, it's kind of assumed that you're already bought into a given ecosystem or something like that, that I mean, that's the biggest thing for me is that it, a lot of for a lot of people and a lot of companies, it takes a lot of engineering effort to get to the point where you can do interesting things with these tools.
Ben Johnson
0:52:33
And like as an extension of that, like that doesn't go just from the collection side, it goes all the way to the analysis side of that, as well. And we think that if, if, you know, are either September's to help empower users to accomplish that, and take ownership of their data and their observer ability strategy, and so like vector is the first project that we're kind of launching and that initiative, but we think it goes all the way across. And so that that like to echo Luke, that is the biggest thing, because so many people get so frustrated with it, where they just throw their hands up and kind of like hand their money over to a vendor, which is, which is fine and a lot of use cases, but it's not empowering. And there's, you know, there's no like silver bullet, like there's no one storage, or one vendor that is going to do everything amazing. And so at the end of the day, it's like being able to take advantage of all the different technology and tools and combine them into like a cohesive observe ability strategy in a way that is flexible. And the to evolve with the times is like the Holy Grail. And that's what we want to enable. And we think, you know, that processes needs quite a bit of improvement.
Tobias Macey
0:53:43
I appreciate the both of you taking your time today to join me and discuss the work that you're doing on vector and timber. It's definitely a very interesting project and one that I hope to be able to make use of soon to facilitate some of my overall data collection efforts. So if appreciate all of your time and effort on that, and I hope you enjoy the rest of your day.
Ben Johnson
0:54:03
Thank you. Yeah, and and just to kind of add to that, if if anyone listening like wants to get involved, ask questions. We have a there's a link community link on the repo itself. You can chat with us. We want to be really transparent and open and we're always welcoming conversations around things we're doing. Yeah, definitely.
Luke Steensen
0:54:22
Just want to emphasize everything Ben said. And thanks so much for having us.
Ben Johnson
0:54:26
Yeah, thank you.
Tobias Macey
0:54:32
For listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used to visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast calm but your story and to help other people find the show please leave a review on iTunes. Tell your friends and coworkers

Building A Community For Data Professionals at Data Council - Episode 96

Summary

Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your original reason for focusing your efforts on fostering a community of data engineers?
    • What was the state of recognition in the industry for that role at the time that you began your efforts?
  • The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?
    • How has the community itself changed and grown over the past few years?
  • Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data?
  • Where do you draw inspiration and direction for how to manage such a large and distributed community?
    • What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?
  • What are some ways that you have been surprised or delighted in your interactions with the data community?
  • How do you approach sustainability of the Data Council community and the organization itself?
  • The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction?
  • In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission?
  • You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?
    • What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?
  • What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?
    • What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses?
  • What are the characteristics of a data business that you look at when evaluating a potential investment?
  • What are some of the current industry trends that you are most excited by?
    • What are some that you find concerning?
  • What are your goals and plans for the future of Data Council?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at lead node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and medium or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Caribbean global intelligence data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Pete Soderling about his work to build and grow a community for data professionals with the data council conferences and meetups as well as his experiences as an investor in data oriented companies. And full disclosure that data Council and clubhouse are both previous sponsors of the podcast and clubhouse is one of Pete's companies that he's invested in. So Pete, can you just start by introducing yourself?
Pete Soderling
0:02:44
Yeah, thanks. Thanks for the opportunity to be here, Tobias. I'm Pete Soderling, as you mentioned, and I'm a serial entrepreneur. I'm also a software engineer from the first internet bubble. And I'm super passionate about making the world a better place for other developers. And you remember it
Tobias Macey
0:02:59
you first get involved in the area of data management?
Pete Soderling
0:03:02
Yeah, I think, funnily enough, the thing that jumps out at me is how excited I was when I was a early developer, very young in my career, I discovered this book database designed for Mere Mortals. And I think I read it over my my holiday vacation one year, and I was sort of amazed at myself at how interesting such a potentially dry topic could be. So that was a, an early indicator. I think fast forward. My first company, actually my second company in 2009, that I started was a company called strata security. And originally, we were building what we thought was a API firewall for a web based API's. But it quickly morphed into a platform that's secured and offered premium data from providers like Bloomberg or Garmin, or companies that had lots of interesting proprietary data sets. And our vision was to become essentially like the electricity in between that data provider and their API. And the consumers were consuming that data to the API so that we could offer basically metered billing based on how much data was consumed. So I guess that was my first significant interest as an entrepreneur in the data space back about 10 years or so ago.
Tobias Macey
0:04:20
And now you have become an investor in data oriented companies, you've also started the series of conferences that were previously known as the data edge confident have been rebranded as data Council. And that all started with your work and founding haka Labs is a community space for people working in the data engineering area. And I'm curious what your original reason was for focusing your efforts in that direction and focusing particularly on data engineers?
Pete Soderling
0:04:51
Yeah, I guess it's, it's gets to the core a bit of who I am. And as I've looked back over my shoulder, as both an engineer and a father, I guess what I've realized, which actually, to some extent, surprised me is that all of the companies I've started, have all had software engineers, as end users, or customers. And I discovered that I really am significantly passionate about getting developers together, helping them share knowledge, and helping them with better tooling, and essentially just making the world awesome for them. And it's become a core mission of everything I do. And I think it basically is infused in all these different opportunities that I'm pursuing. Now. For instance, one of my goals is to help 1000 engineers start companies, but not it gets to some of the startup stuff, which is essentially a side project that we can, we can talk about later. But specifically, as it relates to hacker labs, hacker labs was originally started in 2010, to become a community for software engineers. And originally, we thought that we were solving the engineer recruiting problem. So we had various ideas and products that we tested, rounding, introducing engineers to companies in a trusted way. And originally, that was largely for all software engineers everywhere. And our plan was to turn it into a digital platform. So it was going to have social network dynamics where we were connecting engineers together. And those engineers would help each other find the best jobs. So that was had, you know, very much was sort of in the social networking world. But one of our marketing ideas was we wanted to build a series of events surrounding the digital platform, so that we could essentially lead users from the the community in our events, and introduce them to the digital platform, which was the main goal. And one of the niche areas that we wanted to focus on was data, because data science was becoming a hot thing on data engineering was even more nascent. But I sort of saw the writing on the wall and saw that data engineering was going to be required to support all the interesting data science goodness that was being talked about. And really, I was of interest to business. And so you know, pure the data meetups that we started, were essentially a marketing channel for this other startup that I was working on. And ultimately, that startup didn't work. And the product didn't succeed, which is often the case with network based products, especially in the digital world. But I realized that we had built this brand, surrounding awesome content for software engineers, data engineers through our meetups that we had started. And we fell back on that community, and figured out that there must be another way to monetize and to keep the business going. Because I was so passionate about working with engineers and continuing to build this this community that we had seated. And engineers love the brand. They love the events that that we're running, they loved our commitment to deeply technical content. And so one thing led to another and ultimately, those meetups grew into what data Council is today.
Tobias Macey
0:07:48
And you mentioned that when you first began on this journey, it was in the 2010 timeframe. And as you referred to data engineering as a discipline was not very widely be recognized. And I'm wondering how you have seen the overall evolution of that role and responsibility? And what the state of the industry and what the types of recognition and responsibilities were for data engineers at that time.
Pete Soderling
0:08:16
Yeah, you know, data engineering was just not really a thing at the time. And only the largest tech company is Google and Facebook even had the notion of sort of the data engineering concept. And but I guess, you know, what I've learned from, from watching engineering at those companies is that, because of their scale, they discover things more quickly and more often, or earlier than, than other folks tend to. And so I think that was just interesting, you know, leading indicator and so I, I felt like it was going to get bigger and bigger. But yeah, there was no, I don't even know if if Google necessarily had a data engineering job title at that time. So you know, that was just very early in the in the space. And I think we've seen it develop a lot since since then. And it's not just in the title. But I think, you know, we saw early on that data science was a thing and was going to be a bigger thing. Data engineering was required to to have the data scientists and the quants to do awesome stuff in the first place. And then there's also the analysts who are trying to consume the data sets, oftentimes in slightly different ways, and the data science scientists, so I think early on, we saw that these three types of roles were super fundamental and foundational to building the future of data infrastructure and business insights and data driven products. And so even though we started off running the data engineering meetup, which I think we're still known for, we pretty quickly through the conference embraced these other two disciplines as well, because I think the interplay of how these types of people work together inside organizations, is where the really interesting stuff is. And so you, you know, as these these job descriptions, the themes in these job descriptions and sort of how they unite, and how they work together on projects is fascinating. And so, through data Council, our goal has been to further the industry by getting these folks in the same room around content that they all care about. And sometimes it's teaching a data engineer a little bit more about data science, because that's what makes them stronger and better able to operate on a on a multifunctional team. And sometimes it's the data scientists getting a little bit better at some of the engineering disciplines and understanding more what the data engineers have to go through. So I think that looking at this world in a in a cohesive way, especially across these three roles, has really benefited us and made the community event very unique and strong. And now I should say that, I think the next phase of that in terms of team organization, and especially in terms of our vision with data Council is we're now embracing product managers into that group as well. I think that, you know, there's the stack, we sort of see this stuff, lack of data being data infrastructure on the bottom, then data engineering and pipelines, then data science and algorithms, then data analytics on top of that. And finally, there's the the AI features and the people that are weaponized this entire stock into AI products and data driven features. And I think the final icing on the cake, if you will, is creating space for data oriented product managers, because, you know, it used to be that maybe you think of a data Product Manager is like working for Oracle or being in charge of shipping a database. But that's, you know, that's sort of a bit older school at this point. And there's all kinds of other interesting applications of data infrastructure and data technologies that are not directly in the database world, where real world product managers in the modern era, I'm sort of need to understand how to interact with this stack, and then how to build data tooling, whether its internal tooling for developers, or customer consumer facing, beat. So I think embracing the product manager at the top of this dock has been super helpful for our community as well.
Tobias Macey
0:12:07
And I find it interesting that you are bringing in the product manager, because there has long been a division in particularly with technical disciplines where you have historically the software engineer who is at odds with the systems administrator, and then recently, the data scientist or data analyst who is at odds with the data engineer. But there has been an increasing alignment across the business case, and less so in terms of the technical divisions. And I'm curious what your view is in terms of how the product manager fits in that overall shift in perspective, and what their responsibility is within an organizational structure to help with the overall alignment in terms of requirements and vision and product goals between those different technical disciplines?
Pete Soderling
0:12:59
Yeah, well, hey, I think I think this is just a super This is a My question is a microcosm of what's really happening in the engineering world, because I think software engineers at the core at the central location are actually eating disciplines and roles that used to be sort of beneath them and above them. So again, I'm sort of sticky thinking in terms of this vertical stock. But, you know, most modern tech companies don't have DBS. Because the software engineers now the DPA, and many companies don't have designated infrastructure teams, because a software engineer is responsible for their own infrastructure. And some of that is because of cloud or the dynamics, but sort of what's happening is, you know, at its core, the engineer is eating the world. And it's bigger than just software in the world engineers in the world. And so I think the the absorption of some of these older roles into what's now considered core software engineering has happened below, and I think it's happening above. So I think, some product management is collapsing into the world of the software engineer, or engineers are sort of lathering up into product management. And I think part of that is the nature of these deeply technical products that we're building. So I think many engineers make awesome product managers. I mean, perhaps they have to step away and you know, be willing not to write as much code anymore. But because they have an intrinsic understanding of the way these systems are built. I think engineers are just sort of like reaching and, and absorbing a lot of these other roles. And so some of the best product managers that, you know, I think we've seen have been x software engineers. So I just think that there's a real emerging, this is just a larger perception that I have of the world, into the software engineering related disciplines. And I think it's actually not a far leap, to sort of see how, you know, an engineer, a product manager, who's informed with an engineering discipline is super effective in that role. So I just think this is a broader story that we're seeing overall, if that makes sense.
Tobias Macey
0:15:02
Yeah, I definitely agree that there has been a much broader definition of what the responsibilities are for any given engineer, because of the fact that there are a lot more abstractions for us to build off of, and so it empower his engineers to be able to actually have a much greater impact with a similar amount of actual effort, because of the fact that they don't have to worry about everything from the silicon up to the presentation layer, because there are so many different useful pre built capabilities that they can take advantage of and think more in terms of the systems rather than the individual bits and bytes. Yeah, exactly. And in terms of your overall efforts for community building, and community management, there are a number of different sort of focal points for communities to grow around that happen because of different shared interests or shared history. So there are programming language communities, their communities focused on disciplines, such as, you know, front end development, or product management or business. And I'm wondering what your experience has been in terms of how to orient a community focus along the axis of data, given that it can in some ways be fairly nebulous as to what are the common principles in data because there's so many different ways to come at it, and so many different formats that data can take?
Pete Soderling
0:16:31
Yeah, I think the core, you know, one of the core values for us, and I don't know if this is necessarily specific to date or not, but it's just openness. And I think especially, you know, we're, we see ourselves as much, much more than just a conference series, and we use the word community, in our team and at our events, and just to describe what we're doing dozens and dozens of times a week. And so, yeah, I think the community bond and the mentality for our brand is super high. I think that, you know, there's a, there's also an open source sort of commitment. And I think that's a mentality, I think that's a that's a coding, discipline, and style. And I think that, you know, sharing knowledge is just super important in any of these technical fields. And so, engineers are super thirsty for knowledge. And we see ourselves as being a connecting point where engineers and data scientists can come and share knowledge with each other. I think especially maybe that's a little bit accelerated, in the case of data science or AI research, because these things are changing so fast. And so there is a little bit of an accelerator in terms of the way that we're able to see our community grow and the interest in this space, because so much of this technical stuff is changing quickly. And, you know, engineers need a trusted source to come to where they can find and get surface the best, most interesting research and most interesting open source tools. So we've capitalized on that, and, and we try and be an one on one, and we're sort of a media company. You know, on the other hand, we're an events business. On the other hand, we're a community. But we're sort of putting all these two things together in a way that we think benefits careers for engineers, and it enables them to level up in their careers and make them smarter and get better jobs and meet awesome people. So really, all in all, you know, we see ourselves as building this building this awesome talent community, around data and AI, worldwide. And we think we're in a super unique position to do that and
Tobias Macey
0:18:33
succeed at it. community building can be fairly challenging because of the fact that you have so many different disparate experiences and opinions coming together. And sometimes it can work out great. Sometimes you can have issues that come up just due to natural tensions between people interacting in a similar space. And I'm wondering, what you have been drawing on, for instance, and reference for how to foster and grow a healthy community, and any interesting or challenging or unexpected aspects of that overall experience of community management that you've encountered in the process?
Pete Soderling
0:19:13
Yeah, I think it's an awesome question. Because any company that embraces community, to some degree embraces perhaps somewhat of a wild wild west. And I think some companies and brands manage that very heavily top down, and they want to, and they have the resources to, and they're able to some others, I think, let the community sprawl. And, you know, in our particular case, because we're a tiny startup, I used to say that we're three people into PayPal accounts, I'm running events all over the world, you know, even though we're just a touch bigger than that now, not much. But we have 18 meetups all over the world and forming conferences from San Francisco to Singapore. So I think the only way that we've been able to do that, and just to be clear, like we're up for profit business, but I think that's one of the other things that makes us super unique is that, yes, we're for profit. But at the same time, we're embracing a highly principled notion of community. And we use lots and lots of volunteers in order to help you know further that message worldwide, because we can't afford to hire community managers in every single city that we want to operate in. So so that that's one thing. And I guess, for us, we've just had to embrace kind of the the wild nature of what it means to scale a community worldwide and deal with the ups and downs and challenges that come with that. And, of course, there's some brand risk. And there's, you know, other sorts of frustrations, sometimes working with volunteers, but I guess my inspiration, you know, specifically in this was really through through 500 startups, and I went on geeks on a plane back in 2012, I believe. And when I saw the way that 500 startups, which is a startup accelerator in San Francisco, was building community, all around the world, basically one plane at a time. And I saw how kind of wild and crazy that community was, I sort of learned, like the opportunity and the challenge of building community that way. And I think the main thing, you know, if you can embrace the chaos, and if your business situation forces you to embrace the chaos order to scale, I think the main way that you keep that in line is you just have a few really big core values that you talk about, and you emphasize a lot, because basically, the community has to sort of manage itself against those values. And you know, this, this isn't like a detailed, like, heavy takedown process, because you just can't in that scenario. So I think the most important thing is that the community understands the ethos of what you stand for. And that's why with data Council, you know, there's a couple things I already mentioned open source, that's super important to us. And we're always looking for new ways to lift up open source, and to encourage engineers to submit their open source projects for us to promote them. we prioritize open source talks at our conference. You know, that's just one one thing. I think the other thing for data Council is that we've committed to not be an over sponsored brand. This can make it hard economically for us to be able to grow and, and build the to hire the team that we want to sometimes, but we're very careful about the way we integrate partners and sponsors into our events. And we don't want to be, you know, what we see as some of our competitors being sort of over saturated and swarming with salespeople. So there's a few like, Hi thing, I guess the other thing that that's super important for us is we're just deeply, deeply committed to deeply technical content. We screen all of our talks, and we're uncompromising in the in the speakers that we put on stage. And I think all of these things resonate with engineers like I'm, I'm an engineer. And so I know engineers think and I think these three things have differentiated us from a lot of the other conferences and, and events out there, we realized that there was space for this differentiation. And I think all these things resonate with engineers. And now it makes engineers and data scientists around the world want to raise their hands and help us launch meetups, we were in Singapore. Last month, we launched our first data data council conference there, which was amazing. And the whole community came between Australia and India and the whole region, Southeast Asia, they were there. And we left with three or four new meetups in different cities, because people came to the conference saw what we stood for, saw, they were sitting next to awesome people and awesome engineers. And it wasn't just a bunch of suits at a data conference. And they wanted to go home and take a slice of data console back to their city. And so we support them in creating meetups, and we connect them to our community, and we help them find speakers. And it's just been amazing to really see that thrive. And I think like I said, the main the main thing is just knowing the the core ethos of what you stand for. And even in the crazy times, just being consistent about the way you can you communicate that to the community, letting the community run with it and see what happens. And sometimes it's it's messy, and sometimes it's awesome. And but you know, it's a it's an awesome experiment. And I just think it's incredible that a small company like us can have global reach. And it's only because of the awesome volunteers, community organizers, meetup organizers, track host for our conference that we've been able to suck into this into this orbit. And we just want to make the world a better place for them. And they've been super, super helpful and, and kind and supporting us, and we couldn't have done it without them. So it's been an awesome experiment. And, you know, we're continuing to push forward with that model.
Tobias Macey
0:24:33
With so many different participants and the global nature of the community that you're engaging with, there's a lot of opportunity for different surprises to happen. And I'm wondering what have been some of the most interesting or exciting to paraphrase Bob Ross happy accidents that you have encountered in your overall engagement with the community? Hmm,
Pete Soderling
0:24:57
I guess, this wasn't totally unsurprising. But I just love to sort of surround myself with with geeks, you know, geeks have always been my people. And even when I stopped writing code actively, I just gravitated towards software engineers, and obviously, which is why I'm sort of, you know, I do what I do. And it's what makes me tick. I guess one of the interesting thing through running a conference like this is, you get to meet such awesome startups. And there's so many incredible startups being started outside of the valley. You know, I lived in New York City for many years, and I lived in in San Francisco for many years. And now I spend most of my time in Wyoming. So I'm relatively off the map. And one way of thinking but in the other way, you know, as the center of this conference, we just meet so many awesome engineers and startups all over the globe. And I'm really happy to see such awesome ideas start to spring up from other startup ecosystem. So, you know, I don't believe that all the engineering talent should be focused in Silicon Valley, even though it's easy to go there, learn a ton really benefit from that better from the community benefit from the the big companies with scale. But ultimately, I think, you know, not everyone is going to live in the Bay Area, I hope they don't, because it's already getting a little messy there. But I just want to see, you know, all of these things sort of democratize and distributed, both in terms of software engineering, and then the engineers that start these awesome startups. And so, you know, the the ease with which I'm able to meet and see new startups around the globe, to the data, the data council community, I think it's been a real bright spot in that. And I don't know if it was necessarily a surprise, but maybe it's been a surprise to me at how quickly it's happened.
Tobias Macey
0:26:34
So one of the other components to managing communities is the idea of sustainability and the overall approach to governance and management. And I'm wondering both for the larger community aspect, as well as for the conferences and meetup events, how you approach sustainability to ensure that there is some longevity and continuity in the overall growth and interaction with the community?
Pete Soderling
0:27:02
Yeah, I think I think the main thing, you know, this gets back to another core tenet of sort of the psychographic of a software engineer, software engineers need to know how things work. And that's sort of the core of our mentality in building things. We want to know how things work, if we didn't build it ourselves. We prefer to like, rip off the covers and understand how it works. And you know, to be honest, part of the way that for instance, we select talks at our conference, you know, I think this applies to and we're learning to get better about. I mean, I think as a as a value, we believe in openness and transparency. In our company, I think externally facing, we're getting better about how we actually enable that with the community. But for instance, for our next data council conference that's coming up in New York, and in November, we've published all of our track summaries on GitHub, and we've opened that up to the community where they can contribute ideas, questions, maybe even speakers, theme sub themes, etc. And I think just the nature that, you know, we have the culture to start to plan, our events like this in the open, I think, brings a lot more transparency. And then I guess the other thing about a community that's just sort of inherent, I think, in a well run community, is the amount of diversity you get. And obviously, you know, we're all aware of that, that software engineering as a discipline, is just suffering from a shortage of diversity in certain areas. And I think as we commit to that, locally, regionally, globally, there's so many types of different diversity that we get through the event. So I think both of these things are, you know, are super meaningful in like keeping the momentum of that community moving forward, because we want to continue to grow. And we want to continue to grow by welcoming folks that maybe necessarily didn't necessarily previously identify with the data engineering space, you know, into the community so that they can see what it's like and evaluative if they want to take a run that in their career. So I think all these things, transparency, openness, diversity, these are all Hallmark hallmarks of a great community and, and these are the engines that keep the community going and moving forward. Sometimes in spite of the resources or the lack of resources, you know, that a company like data council itself, can muster at any one time.
Tobias Macey
0:29:22
In terms of the conference itself, the tagline that you focused on, and we've talked a little bit about this already, is that they are no fluff, paraphrasing your specific wording, and as a way of juxtaposing them against some of the larger events that are a little bit more business oriented, not calling out any specific names. And I'm wondering what your guidelines are for fulfilling that promise, and why you think it is an important distinction? And conversely, what some of the benefits are for those larger organizations, and how the two scales of event play off each other to help foster the overall community?
Pete Soderling
0:30:02
Yeah, well, one, one thing here is, I think, comes to the mentality of the engineer. And then the other side of it is the mentality of the sponsor and the partner. And, you know, hey, I think engineers are just Noble. And like I said, engineers want transparency, they want to know how things work. They don't want to be oversold to, you know, they want to try product for the self. There's just all of these sort of things baked into the engineering mindset. And first and foremost, we want to be known as the conference in the community that respects that, like, that's the main thing, because engineers like without engineers, and our community, loving and getting to know each other, we're not careful about the opportunities in the context that we create for them, they're just going to run in the other direction. And so like, first and foremost, like those are the hallmarks of of what we're building from the engineering side. Then on the partnership side, I think companies are not great at understanding how engineers think recruiters are not great at talking to engineers, marketers are not great at talking to engineers. Yes, engineers need jobs. And yes, engineers need new products and tools, but to find companies that actually know how to respect the mental hurdles that engineers have to get through, in order to like get interested in your product or get interested in your job. You know, that's a super significant thing. And through my years of working in the space, I've done a lot of coaching and consulting with CTOs, specifically surrounding these two things, recruiting and, and marketing to engineers. And I think that awesome companies who respect the the central place that engineers have, and will continue to have in the innovation economy that's coming, realize that they have to change their tune in the way they approach these engineers. So I, you know, our conference platform is a mechanism that we can use to gently sort of steer and even teach some partners how to interact with engineers in a way that doesn't scare them away. And so just broadly speaking, like I mentioned, we're just super careful about how we integrate partners with our event. And we're always as a team trying to come up with, with better ways to message this and, and better ways to educate and, and sort of welcome sponsors, you know, into the special, the special network that we've built, but it's challenging, you know, like not not all marketers think alike. And not all marketers know how to talk to engineers, but we're committed to creating a new opportunity for them to engage with awesome data scientists and data engineers in a way that's valuable for both of them. And that's a really fun, big challenge. And, you know, we're not as worried about how much as it scales right now, as much as we were the quality, enhancing the quality of those interactions. And so that's what we're committed to as a brand. And, you know, it's not always easy. But we've we learned a lot, and we have a lot to learn. And we always sort of touch touch base with the community after the events and sort of asked the community what they thought and how they interact with the partners, then did they find a new job? And how did that happen? And so we're always trying to pour gasoline on what works, not respecting continue to innovate and move forward in that way,
Tobias Macey
0:33:03
in addition to your work on building these events, and growing the meetups, and overall network opportunities for people in the engineering space, you have also become an involved as an investor. And you've mentioned that you focus on data oriented businesses. And I'm curious how you ended up getting involved in that overall endeavor, and how that fits into your work with data Council and what your overall personal mission is. Oh, yeah.
Pete Soderling
0:33:30
Well, that's, that's definitely one of my side projects. As I mentioned, I want to help 1000 engineers start companies, and, you know, this is just part of what makes me tick, just helping software engineers through the conference, through advising their startups, you know, through investing through helping them figure out go to market, I guess a lot of this, this energy for me came, you know, from having started for companies, myself, and as an engineer who didn't go to school, but instead opted to start, you know, my first company in New York City in 2003. You know, there weren't a lot of people that had started three companies in New York by the time, the early, you know, sort of, or the layoffs came around. So yeah, I guess I've learned a lot of things the hard way. And I think a lot of engineers are kind of self taught. And they also learn things, they tend to learn things the hard way. So I guess, a lot of my passion there is again, sort of meeting engineers, where they're at how they learn. And you know, to them, like, I'm kind of a business guy. Now, I have experience with starting companies, building products, fundraising, marketing, building sales teams, and you know, most of those things are not necessarily been Top of Mind, for many software engineers that want to start a company, they have a great idea. They're totally capable of engineering it and building a product, but they need help, and all the other, you know, software, businesses, LZ stuff, as well as fundraising. So I guess I just figured, I've discovered the sort of special place I have in the world where I'm able to help them coach them through a lot of those businesses is I could never build the infrastructure that they're building or figure out the the algorithms or the models that they're building, but I can help them figure out how to pitch it to investors, how to pitch it to customers, how to go to market, how to hire a team that scales. And so I discovered that I just had, you know, through my ups and downs, as an entrepreneur, I've developed a large set of early stage hustle, experience, and I'm just super hot, happy to pass that on to other engineers who are also passionate about starting companies. So that's just something I find myself doing anyway, you know, as a mentor for 500 startups or as a mentor for other companies. And one thing led to another and soon I started to do angel investing. And now I have an Angeles syndicate, which is actually quite interesting, because it's backed by a bunch of CTOs and heads of data science and engineers from our community, who all co invest with me. And as I'm able to help companies bring their products to market startups come to market, oftentimes will be an investment opportunity there. And so I'll be another network of technical people who add value to that company even more. So I'm just sort of the, you know, a connector in this community. And the community is doing all kinds of awesome stuff, you know, inside and even sometimes outside of data console, which is just a testament to the power of community overall. And, you know, I just happened to be, I'm super grateful that I'm along for the ride. And I got to I got engineers who come to me and trust me for help. And I'm able to connect these dots and and help them succeed as well,
Tobias Macey
0:36:30
in terms of the ways that businesses are structured. I'm wondering what it is about companies that are founded by engineers and led by engineers that makes them stand out, and why you think that it's important that engineers start their own companies, and how that compares to businesses that are founded and run by people who are coming more from the business side. And just your overall experience in terms of the types of focus and levels of success that two different sort of founding patterns end up playing out?
Pete Soderling
0:37:03
Yeah, well, yeah. I mean, you can tell based on what I've been saying that I'm just super bullish on the software engineer. And, you know, does that mean that the software engineer as a persona or a mentality or a skill set, you know, is inherently awesome and has no weaknesses? And no problems? Like hell? No, of course not. And I think the some of the challenges of being a software engineer and how your mentality fits into the rest of the business are well documented. So I think all of us as engineers need to grow and diversify and increase the breadth of our skills. And so that has to happen. But on the other hand, if we believe that innovation is going to continue to sweep the world, and highly technical solutions, perhaps to sometimes non technical problems, perhaps sometimes the technical problems are going to continue to emerge. I feel like people who have the understanding of the the technical implications and the realities and the architectures and the science of those solutions, just have an interesting edge. So I think there's a lot of hope in teaching software engineers how to be great CEOs. And I think that's, that's increasingly happening. I mean, look at Jeff Lawson from Twilio. Or the guys from stripe, even Uber was started by an engineer, right? There was the the the quiet engineer at goober at Uber, Garrett, sort of quiet in terms of Travis, you know, who, who was a co founder of that company. So I think we're seeing engineers, not just start the most amazing companies that are that are changing the world, but they're increasingly in positions of becoming CEOs. And those companies, you know, I guess you might even take that one step further. And I'm kind of trying to be an engineer who's also been an operator and a founder. But now I'm, I'm stepping up to becoming a VC and, and being an investor. So I think there's the engineer VC, which is really interesting, as well. But I think that's a slightly different conversation. But but suffice it to say that engineers are bringing a valid mentality into all of these disciplines. And yes, of course, an engineer has to be taught to think like a CEO, and has to learn how to be strategic and has to learn sales and marketing skills. But I think it's just an awesome, awesome challenge to be able to layer those skills on top of engineering discipline that they already have. And I'm not saying this is the only way to start a company or that business people, you know, can't find awesome engineers to partner with them. I mean, honestly, I think an engineer often needs a business co founder, to help get things going. But I I'm coming at it from the engineering side, and then figuring out like, all the other support that the engineer needs to make a successful company, and that's just because I've chosen that particular way, but other people will be coming at it from the business side, and, and I'm sure that will be fine for them as well,
Tobias Macey
0:39:47
in terms of the challenges that are often faced for an engineering founder in growing a business and making it viable. What are some of the common points of conflict or misunderstandings or challenges that they encounter? And what are some of the ways that you typically work to help them in establishing the business and setting out on a path to success? Well, I
Pete Soderling
0:40:10
think the the biggest thing, you know, that I see is, many engineers are just afraid to sell. And unfortunately, you know, you can't have a business if you can't have some sales. And so somehow, engineers have to get over that hurdle. And that can be a long process. It's been a long process for me. And I still undersell what we do at data council to be honest, in some ways, and I have people around me to help me do that. And we want to do that, again, in a way that's in line with the community. But I'm constantly trying to figure out how to be essentially a better salesperson. But for me, that means that still retaining sticking to the core of my engineering values, which is honesty, transparency, enthusiasm, you know, value and really understanding how to articulate the value of what you're bringing in a way that's, that's unabashedly honest and transparent. So I think that's a, that's a really big thing for a pure engineer founder is, it can be difficult to go out there and figure out how to get your first customers, you know, how to start to sell yourself personally. And then the next step is how do you build a sales culture and a process and a team that's in line with your values as a company, and that scares, you know, that scares some engineer, because it's just terrifying to think about building a sales org, when you can barely accept that your product needs to be sold yourself. But I think that's just a you know, that's sort of ground zero for starting a company. And so, you know, I try and be as gentle as possible and, and sort of guiding engineers through that process. But I guess that's the one of the core hiccups that I think engineers have to figure out how to get over by bringing in other people they trust or getting advice, or, you know, you have to approach it sometimes with kid gloves, but teaching engineers how to sell in a way that's true to their values. I think it's just a really big, big elephant in this room that, you know, I constantly run into and, and try to help engineer founders with
Tobias Macey
0:42:08
in terms of the businesses that you evaluate for investing in what is your strategy for finding the opportunities? And what are the characteristics of a business that you look for when deciding whether or not to invest? And then as a corollary to all of that? Why is it that you're focusing primarily on businesses that are focused mainly in the data space, and the types of services that you're looking for, that you think will be successful and useful?
Pete Soderling
0:42:37
Yeah. Well, I guess last question. First, I think, you know, it's important to have focus as an investor, and not everybody can do everything awesome. I think it's also a strategy to building a new brand, and a niche fund in the face of the sequoias and the corners of the world. I think we might have like we might be past the day is when a new fund can easily build a brand. That's, that's that expansive. So I think this is just kind of, you know, typical marketing strategy. I think if you start to focus on a niche and do one niche really well, I think that produces network effects, smaller network effects inside that niche that then grow and grow and grow and develop. So I mean, I've chosen to focus in my professional life, on data, data, data science, data engineering, data analytics, that's data console. Partly that's because I just believe in the upside of that market. So I think that's just a natural analogue to a lot of my investing activity, because I'm knowledgeable about the space because I have a huge network in the space, because I'm looking at interested in talking to these companies all the time. Um, it's just a natural fit. That's not to say that I don't I mean, I'm also passionate about broader developer tools. And as you mentioned earlier, I'm an investor in clubhouse. I'm interested in security companies. So I think, you know, for me, there are some analogues to just like, deeply technical companies, you know, look by technical founders, that that also fit my thesis. But, you know, still it's a fairly niche, narrow thesis, like most of the stuff I do. On the investing side is b2b, I meet the companies through my network and, and through data Council, I think they're solving you know, meaningful problems in the b2b space. And other criteria often is, they may be supported by some previous open source success, or the company might be built on some current open source project that I feel gives them an unfair advantage when it comes to distribution, and getting a community excited about the product. So these are a few of the things that I look for, in terms of the investing thesis
Tobias Macey
0:44:39
in terms of the overall industry and some of the trends that you're observing and your interaction with engineers and with vetting the different conference presentations and meetup topics, what are some of the trends that you're most excited by? And what are some of the ones that you are either concerned by or some potential issues that you see coming down the road in terms of the overall direction that we are heading as far as challenges that might be lurking around the corner?
Pete Soderling
0:45:10
Well, I think, you know, one big thing there is just data science and ethics, ai fairness, bias in AI, and in deep learning models, ethics, when it comes to, you know, targeting customers, what data you keep on people, like I think all these things are just super interesting is business issues, their policy issues or business issues. At one level, they're also technical issues. So there's technical implementation stuff that's required. But I just think raising that discussion is important. And so that's one area that we're focusing on, and data Council in the next series of events that we run later this year. And next year, is elevating that content in front of our community so that it can be a matter of discussion, because I think those are important topics are always seen as the most technical. But I think they're super important in terms of us, trying to help the community steer and figure out where the ship is going in the future. So I think that's super interesting. And then, I guess on the technical side, I think the data ops world is starting to mature. I think that there's a lot of interesting opportunities there in the same way that the DevOps revolution, you know, revolutionized the way that software was built, tested, deployed, monitored, and companies like chef and New Relic, you know, came came out in perhaps the mid 2000s. I think we're at a similar inflection point, with data Ops, if you will. And there's more repeatable pipeline process. There's back testing and an auditing capabilities that are required for data pipelines that aren't always there. There's monitoring infrastructure that's being built, and some some cool companies that I've seen recently. So I think data Ops, and basically just elevating data engineering to the level that software engineering has been out for a while. It's definitely something that seems to be catching fire. And we also, you know, try and support to the conference as well.
Tobias Macey
0:47:06
Looking forward, what are some of the goals and plans that you have for the future of the data council business and the overall community and events?
Pete Soderling
0:47:17
Well, I think, as I mentioned, our biggest goal is to build the data and AI talent community worldwide. And I think that there's, we're building a network business. So I guess it kind of takes me back to when I started, hacker labs, which was the digital social network, and I thought we were building a digital product. And as I already mentioned, one thing led to another and now we have data console instead. Well, data console is, you know, butts and seats at events and community. And it's IRL, engineers getting together. But it's still a network. It's not a super fast digital, formal network. But it's a network. And it's a super meaningful network. So it's kind of interesting that after all, the ups and downs, like I still see ourselves as in the network building business. And I think the cool thing about building a network is once you build a network, there's lots of different value that you can add on it. So I think in terms of there's, there's really big ideas here, there's formalizing education and education offerings, there's consulting services that can be built on top of this network, where companies can help each other, or recruiting sort of fits in the same vein, I think there's there's other things as well, there's a fund, which I have mentioned is a side project that I have to help engineers in this community start companies. So I think there's, there's all kinds of interesting things that you can build on top of a network, once you've gotten there. And for now, you know, our events are essentially a breakeven business that just gives us an excuse, and the ability to grow this network, all around the world globally. But I think, you know, there's a much bigger, like phase two, or phase three of this company where we can build really awesome stuff based in this engineering network, and network and a brand that engineers trust that we've laid down in the in the early part of the building phase. So I'm really excited to see that and, and develop that strategy and mission going forward.
Tobias Macey
0:49:14
Are there any other aspects of your work on data council or your investment, or your overall efforts in the data space that we didn't discuss yet that you'd like to cover before we close out the show?
Pete Soderling
0:49:26
No, I think I think we covered a lot of stuff. I hope it was interesting for you and your audience. And I encourage folks to reach out to me and, and get in touch. If there's engineers out there that want to start companies. If there's engineers that want to participate in our community worldwide, we're always looking for awesome people to help us find screen talks. We're interested in awesome speakers as well. I'm always interested in talking to deep learning and AI researchers who are out there who might have ideas that they want to bring it to market. But yeah, you can reach me at Pete data council dot A. And I'm happy to plug you into our community. And yeah, if I can be helpful to anyone out there, I would just really encourage them to reach out.
Tobias Macey
0:50:09
And for anyone who does want to follow up with you, or keep in touch or follow along with the work that you're doing. I'll have your contact information in the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Pete Soderling
0:50:26
Yeah, I think, as I mentioned, I think maybe it comes down to this, this data ops thing, right? There's there's really interesting open source projects coming out, like Great Expectations is one interesting companies coming out like elemental, which is built around Dexter, which is an open source project. So I think that this is a really interesting niche sort of tooling area, specifically in the data engineering world that I think we should all be watching. And then I guess the other sort of category of tooling, I'm seeing it sort of related. It's also in the monitoring space, it's watching the data in your data warehouse, to see if there's anomalies that sort of pop up, because we're all we're pulling together data from so many different hundreds of sources now that I think it's a little bit tricky to watch for data quality, and integrity. And so I think there's a new suite of tools that are popping up in that data monitoring, um, space, which are very interesting. So those are a couple of areas that I'm interested that I'm interested in and looking at, especially when it comes to data engineering applications.
Tobias Macey
0:51:26
Well, thank you very much for taking the time today to share your experiences and building and growing these events series and contributing to the overall data community as well as your efforts on the investment and business side. So definitely an area that I find valuable and I've been keeping an eye on your conferences. There's been a lot of great talks that come out of it. So I appreciate all of your work on that front, and I hope you enjoy the rest of your day.
Pete Soderling
0:51:50
Yeah, thanks, Tobias. We'll see you at the data council conference sometime soon.
Tobias Macey
0:51:59
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts FD to engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Summary

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
    • What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
  • Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
  • Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
  • For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
    • How does HPCC Systems manage persistence and scalability?
  • What are the integration points available for extending and enhancing the HPCC Systems platform?
  • What is involved in deploying and managing a production installation of HPCC Systems?
  • The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
  • How does the Thor engine manage data transformation and manipulation?
    • What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
  • For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
  • How are you using the HPCC Systems platform in your work at LexisNexis?
  • Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
  • How is the HPCC Systems project governed, and what is your approach to sustainability?
    • What are some of the additional capabilities that are only available in the enterprise distribution?
  • When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
  • What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
  • What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
  • What do you have planned for the future of HPCC Systems?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:14
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity in the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register. Your host is Tobias Macey, and today I'm interviewing Flavio Villeneuve, about the HPC project and his work and his work at Lexis Nexis risk solutions. So Flavio, can you start by introducing yourself,
Flavio Villanustre
0:01:36
of course to be so my name is fluid ministry, I'm Vice President of technology and ca. So for Lex and XV solutions. We in the electronics solutions, we have a data platform called the HB systems platform, we have made it both from open source in 2011. And since then, I've been also as part of my role involved with leading the open source community initiative. ensuring that the open source community truly leverages the platform helps contribute to the platform, and certainly creating a liaison between the next Nexus solutions organization and the rest of the larger open source community.
Tobias Macey
0:02:19
And do you remember how you first got involved in the area of data management
Flavio Villanustre
0:02:22
thoroughly has been seamless, and it's probably in the early 90s been going through the database to the database analytics to data management to Data Integration? Keep in mind that even within Lex Nexus, we started the HP systems platform back earlier before year 2000. So back then we already had data management challenges with traditional platforms. And we started with this. And I've been involved since I joined the company in 2002, just with HPC. But before I've been in data management for for a very long time.
Tobias Macey
0:02:57
And so for the HPC system itself, can you talk through some of the problems that it was designed to solve and some of the issues that you were facing at Lexis Nexis that led to its original creation?
Flavio Villanustre
0:03:10
Oh, absolutely. So in Mexico, we solutions, we started with management, I say, our core competency back in the mid 90s. And as we go into risk management, one of the core assets when you are trying to assess risk, and predict outcomes is data. Even before people spoke about big data, we had a significant amount of data, mostly structured, semi structured data to but the vast majority structured. And we used to use the traditional platforms out there, whatever we could get our hands on. And again, this is old, back in the day before Hadoop. And before MapReduce was applied as a distributed paradigm for data management or anything like that. So databases, Sybase, Oracle, whatever was Microsoft SQL, data management platforms of initio information, whatever was available at the time. And certainly the biggest problem we had with a scalable, but was twofold one was the scalability, all of those solutions typically run in a single system. So there is a limit to how much bigger you can go vertically. And certainly, if you're trying to also consider cost affordability of the system. And that limit is that is much lower as well, right, there is a point where you go beyond what the commodity system is, and you start paying a premium price for whatever it is. So that was the first piece. So one of the one of the attempts of solving this problem was to split the data and use different systems but splitting the data, it creates also challenges around data integration, if you're trying to link data, surely you can take the traditional approach, which is you segment your data into tables. And you put those tables in different databases, and then use terms of the foreign key to join the data. But that's all good and dandy as long as you have a foreign key that is unique, handheld is reliable. And that's not the case with data that you acquire from the outside. If you didn't read the data, you can have that if you bring the data from the outside, you might have a record that says these records about john smith, and you might have another record that says this liquid Mr. john smith, but do you know for sure he does. Two records are about the same john smith. And that's, that's a Lincoln problem. And the only way that you can do Lincoln effectively is to put all the data together. And now you have a we have this particular issue where in order to scale, we need to segment the data, in order to be able to do what we need to do, we need to put the data in the same data lake as a dome team. Later, we used to call this data land, eventually we teach it term in the late 2000s. Because data lake become became more more well known. So at that point, the potential bats to overcome the challenge where well, we either split all of the data as we were before, and then we come up with some sort of meta system that will leverage all of these 3d data stores. And potentially, when when you're doing prolific linkage, you have problems that are in have the computational complexity always square or worse. So that means that we will be a significant price and performance but potentially can be done if you have enough time and your systems are big enough, and you have enough bandwidth in between the systems. But the complexity you're gaining from a programming standpoint is also quite significant. And
0:06:33
some things you don't
0:06:34
have enough time some things you get data updates that are maybe hourly or daily. And the doing this big linking process may take you weeks or months if you're doing this across different systems. So and the complexity in programming, this is also pretty significant factor to consider. So at that point, we thought that maybe better approach to these was to create them. So defend an underlying platform to apply this type of solutions to problems with algorithms in a divide and conquer type of approach, we would have something that would partition the data automatically. And that will distribute the data in partitions into different commodity computers. And then we would add an abstraction layer on top of it that would create a programming interface that gave you the appearance that you are dealing with a single system with a single data store. And whatever you coded for that data store would be automatically distributed to the underlying partitions. We would also because of the way the hardware was fighting slower than it is today, we thought that a good idea would be to move also as much of the algorithm as we could to those partitions rather than executing the centrally. So instead of bringing all of the data to a single place to process these, which the single place might not have enough capacity would be to do as much as you can for a couple of brief field operation or a distributed grouping operation or through the filtering operation across each one of the politicians. And eventually, once you need to do the global aggregation, you can do it centrally. But now with a far smaller data set that is already pre filter. And the time came to define how to build abstraction layer. The one thing that we knew about was SQL as a programming language. And we said, well, this must be something that we can track with SQL as a permanent interface for our data analysts. But they work with us quite used to a data flow model because of the type of tools they were using before. Things like a couple of an issue where the data flows are these diagrams where your notes are the operation. So the activities you do and the data and the lines connecting the flow, the activities represent the data traversing those. So we thought that a better approach than SQL would be to create the language that a gave you the ability to build this sort of data flows in that system. That's how easy it was born, which is the language that runs on WHVZ and HPC.
Tobias Macey
0:09:05
So it's interesting that you had all of these very forward looking ideas in terms of how to approach data management, well, in advance of when the overall industry started to encounter the same types of problems as far as the size and scope of the data that they were dealing with that led to the rise of the Hadoop ecosystem, and the overall ideas around data, lakes and MapReduce and some of the new data management paradigms that have come up. And I'm wondering what the overall landscape looked like in the early days of building the HPC system that required you to implement this in house and some of the other systems or ideas that you drew on for inspiration for some of these approaches towards the data management and the overall systems architecture for HPC?
Flavio Villanustre
0:09:52
That is a great question. So it's interesting, because in the early days, when we told people what we were doing, they will look as bad often asked, Well, why don't you use database x, y, z, or data management system XYZ. And the reality is that none of those would be able to cope with the type of data, frequent data process, they wouldn't offer the flexibility of the process, like this probabilistic record linkage that we that I explained before that we do, and certainly good an offer in seamless transition between data management and data forwarding, which was also one of the important requirements that we had a time, it was quite difficult to explain to others why we were doing this, and what we were gaining by doing this. So map and reduce operations as, as functional programming operations have been around for a very long time since the list Lyft days in the 50s. But the idea of using map and with us as operations for the data management didn't get published and build this, I think was September December 2004. And I remember reading the original paper from the Google researchers thinking, well, now someone else has the same problem. And they got to do something about it. Even though at the time we were already we already have HPC. And we already had the CL so it was a perhaps too, too late to go back and try to re implement the data management aspects and the and the programming layer abstraction on HPC. So just for those people in the audience that don't know much about the CL, again is this is all open source or open source and free Apache to license and there are no no strings attached. So please go there and look at it. But in summary, ECL is a declarative Dataflow programming language. And not unlike declarative manner, what you can find in an SQL or functional programming languages, Haskell maybe wait of Bremen, Lisp and closure and other permanent oh there. But if data flow, from their standpoint is closer to something like TensorFlow, if you are familiar with TensorFlow as deep learning, programming paradigm, and framework, so where you could data operations that are primitive, that our data primitives, like for example, sort, you can say sort data set by this column in this order. And then you can add more modifier if you want, you can do a join across data sets. And again, the abrasions join, and you can do a roll up operation and operations name roll up. All of these are high level operations, you define them in your program. And in a declarative programming language, you create definitions, rather than assign variables. For those that are not familiar with declarative programming. And so many are in this audience. The collective programming has, for the most part, the property of having immutable data structures, which doesn't mean that you cannot do valuable work. And you can do all of the work the same way or better. But it gives get gets rid of side effects and other pretty bad issues with a more traditional immutable data structures. So you define out to you to define things, I have a data set that has a phone book, and I want to define an attribute that is this data set, filter by a particular variable. And then I might define another attribute that uses the filter data set to now group it in a particular way. So at the end of the day, any single program is just a set of definitions that are compiled by your compiler. And these compilers is yelling to see which then men reality c++, which then goes into the c++ compiler of the system, this is your plan or whatever you have and generate assembly code. And that that is the goal that is run in the platform. But the fact that you feel is such a high level programming language, and the fact that is declarative means that the CL compiler can take decisions that otherwise more imperative type of programming language wouldn't allow the compiler to take the compiler in a declarative programming language. And functional languages is also in case knows the ultimate goal of the program, because the problem is, in some ways, is a morphic to an equation. And you could even line from a functional standpoint, every one of your statements into a single massive statement, which you of course, can do from a clinical standpoint. But the compiler can now for example, do things like apply non strictness, if you put a statement, if you made a definition that is never going to be used, there is no point for that definition to be even compiled in or executed that all that saves performance equal. If you have a conditional fork in a place in your in your code, but that condition is always met or never met, then there is no need to compile the other branch a all of these gives you performance implications that can be far more significant. When you're dealing with big data. One of the particular optimizations can be around data and calculation, it is a lot far, it's far more efficient than a lot faster, if you are going to do similar operations to every legislator said to combine all of those operations, and do only one person to data with all the abrasions if it's possible. And they combine laser compatible as exactly that. And, and takes away a little bit of the perhaps flexibility on the programmer by making it far more intelligent at the moment, it's compiled. Of course parameters can tell the compiler I know better and forced to do something that may be otherwise unreasonable. But a just an example. You could say, well, I want to sort this data set and I then I want to filter it out and get only these few records. And if you say that in that order there, a an embedded the programming language would first sort and sort of even in the optimal, most optimal case is it's an N login type of operation and condition of complexity, and then fill it out and get only a few records out of it, when the optimal situation would be to filter it out first, and get those few records and then sort those records and ACL competitors. exactly that.
Tobias Macey
0:16:01
The fact that the language that you're using for defining these manipulations ends up being compiled. And I know that it's implemented and C and c++, both the ACL language itself as well as the overall HPC platform is definitely a great opportunity for better performance characteristics. And I know that in the comparisons that you have available for going between HPC and Hadoop, that's one of the things that gets called out. And as far as the overall workflow for somebody who is interacting with the system using that ECM language. I'm curious if the compilation step ends up being in any way a not a hindrance, but a delaying factor as far as being able to do some experimental iteration or if there is the capability of doing some level of interactive analysis or interaction with the data for being able to determine what is the appropriate set of statements to be able to get the desired end result when you're building an ACL data flow?
Flavio Villanustre
0:17:05
Nice. Another great question, I can see that quite diverse
0:17:10
and programming. So you're right, the fact that the seal is compiled means that just again, for for the rest of the audience, we have an integrated development environment policy, like the and of course, we support other like Eclipse and Visual Studio and all of the standard ones, but I'll just talk about it, feel it because it's what I mostly use. In that case, when you write code, you write the ATL code, and then you do, you can certainly run the test of the Golden but if you verify that that gold is, is correct, syntactically, but at some point, you want to run the gold because you want to get it in, you didn't want to know if semantically makes sense, and it will give you the right results. Right so and running the gold we go through the compilation process, depending on how large your code bases, certainly the competition process can take longer. Now the compiler does know what can be modified. Remember, again, a Felisa declarative programming language. So if you haven't touched a number of attributes, and again, data structures are immutable, and add to use the DOM change, since there are no side effects should be exactly the same. So the fact that a when you define a function, that function cause referential transparency, that means that if you call the function at any time, it will give you the correct result, or the same result based on the parameter and just based on the parameter that you're passing with that the compiler can take some shortcuts. And if you are re compiling some bunch of UCL attributes, but you haven't done too many of them, it will just use the pre compiled code for those and only compile those of you have changed. So the completion process, when you are dead, delicately working on code tends to be fairly quick, maybe a few seconds, of course, you depend on having any car company find it available. Traditionally, we used to have a centralized approach to the Excel compiler, when it would be one or a few of them running in the system, we have moved to a more distributed model where when you deploy your refill ID and you refill tools in your workstation, there's a compiler that goes there. So the field completion process can happen in the workstation as well. And that gives you the ability to have it available for you at all times when you're trying to use it. The one of the bottlenecks was at some point before, when you were trying to do this quick adaptive programming approach to things and the compiler was being used by someone that was compiling a massive amount of PCs from some a completely new job, and may have taken minutes and you were does they are sitting, picking your nose waiting for the compiler to to finish that one completed. By the way, the time to compile this is an extremely important consideration. And we continue, we improved the compiler to make it faster. We we have learned you can imagine over a bit. By the way, some of the same core developers have developed the CL compiler governing holiday, for example, have been with us since the very beginning they he was one of the core architects became the initial design of the platform. And he's still the lead architect that is developing that ECM compiler, which means that a lot of the knowledge that has gone into into the compiler process and optimizing it is still getting better and better. Of course, now with the larger community working on the compiler and and more people involved and more documentation around it means that others can pick up where he leaves. But hopefully he will be around and doing this for a long time. And making sure that the compiler is as Justin time as it can be is is very there is no at this point interpreters for ECL. And I think it would be quite difficult to make it completely interactive where the point where you submit just a line of code and does something because of the way a declarative programming paradigm works, right.
Tobias Macey
0:21:17
And also, because of the fact that you're working most likely, with large volumes of data distributed across multiple nodes, being able to do a rebel driven development is not really very practical, or it doesn't really make a lot of sense. But the fact that there is this fast compilation step in the ability to have a near real time interactivity, as far as seeing what the output of your programming is, it's good to see particularly in the Big Data landscape, where I know that the overall MapReduce paradigm was plagued in the early years by the fact that it was such a time consuming process to submit a job and then see what the output was before you could then take that feedback and wrap that into your next attack. And that's why there has been so many different layers added on top of the Hadoop platform in the form of pig and hive and various sequel interfaces to be able to get a more real time and interactive and iterative development cycle built in.
Flavio Villanustre
0:22:14
Yeah, you're absolutely right there. Now, one thing that I haven't told the audience yet is how the platform looks like mine. And I think that this we are getting to the point where it's quite important to explain that there are two main components in the HPC systems platform, there is one component that does data integration, these these massive data management engine equivalent to your data lake management system, which is called for for is meant to run one PCL work unit at a time which a What can it can consist of a large number of abrasions and many of them are running parallel Of course, and there is another one which is known as Roxy which is the data delivery engine there is one which is a sort of a hybrid called AH for now Roxy an H store both are designed to operate in 10s of thousands or more operations at the same time simultaneously, for is meant to do one work unit at a time. So, when you are developing on for even though your completion process might be quick, and you might run on a small data sets quickly, because you can execute this work in it on those little data sets using For example, h4, if you are trying to do the data in large data transformation of a large data sets in your phone system, you still need to go to the queue in that for and you will get your time whenever it's due for you right, surely you can we have priority, so you can jump into a higher priority queue and maybe you are you can be queued after a the just the current job. But before any other future jobs, we also partition jobs into smaller unit. And those smaller units can be also segmented, they are fairly independent from each other. So we could even interleaved some of your your jobs into in between a job that is running by getting into each one of those segments of the of the work unit. But if they get active in this there is a little bit less than a than optimal, but it is the nature of the basis because you want to have a large system to be able to process this throughout all the data in a relatively fast manner. And if we were trying to truly multi process they are most likely many of the resources available, good suffer, so you may end up paying a significant overhead across all of the president or running in parallel. Now. I did say that full run only one working at a time. But that was a little bit of a lie. That was really a few years ago. Today, it does run you you can define multiple QC in a store. And you can make run 34 then work units, but certainly not thousands of them. So that's a that's a big difference between that and Roxy, can you run your work in it and Roxy, yes, or in each floor. And they will run concurrently with anything else that is running with almost no limit their thousands and thousands of them can run at the same time. But there are other considerations on when you run things on Roxy or H store versus in floor. So it might not be what you really want.
Tobias Macey
0:25:29
Taking that a bit further, can you talk through a standard workflow for somebody who has some data problem that they're trying to solve and the overall lifecycle of the information as it starts from the source system gets loaded into the storage layer for the HPC platform. They define an ACL job for that then gets executed and Thor h store and then being able to query it out the other end from Roxy and just the overall systems that get interacted with each other rage about data life cycle.
Flavio Villanustre
0:26:01
co I love to so very well let's let's set up something very simple. As an example, you have a number of data sets that are coming from the outside, you need to load those data sets into HPC. So the first operation that happens is something that is known as spray spray is simple process is an spray comes from the concept of spray painting the data across the cluster, right. So this runs on a Windows box or a Linux box and it will take the data set, let's say that your data set is just given number in million records long. It will unusual as it can be in any format, CSV or or any other or fixed length limited or whatever. So it will look at your data total data set, it will look at the size of the four cluster where the data will be saved initially for processing. And let's say that you have a million records in your data set and you have MN nodes on your for let's just make round numbers and the small numbers. So it will a petition the dataset into 10 partitions because you have to note and it will a then just copy transfer each one of those partitions to the corresponding to full node This is done. If it can be better lies in some way, because for example, your latest fix link, it will automatically use pointers and paralyze this if the data is in either no and XML format or in the limited format where it's very hard to find the partition points, you will need to do a pass in the data, find the friction points and eventually do the panel copying to the thought system. So now you will end up with 10 partitions of the data with the data in no particular order, the Netherlands, all of them that you had before, right. So the first 100,000 records will go to the first note the second 100,000 Records, we go to the second node and so on so forth until you go to the end of the data set this put each one of the nodes in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is spread or
0:28:10
while the data has been sprayed. And depending on the length of the data,
0:28:13
or or even before year, you will most likely need to arrive at work you need to work on the data. And I'm trying to do this example in a way that he said that data The first thing you see that data. So otherwise, all of these automated, right, so you don't need to do anything manually. All of this is scheduled and automated. And working that you already had will run on the new data set and will have appended or whatever it needs to be done. But let's imagine that is completely. So now you write your work unit. And let's say that your latest said was a phone book, and you want to first of all, and even a duplicate, build some rollout views on the phone book. And eventually you want to allow the users to run some queries on a web interface to look up people in the phone book. So you and let's just for the sake of an argument argument, let's say that you're also trying to join that phone book with your customer contact information. So, you will right they will connect that it will have that join to merge those two, you will have some duplication and perhaps some sort of thing. And after you have that you will need to build you will want you don't need to but you will want to build some keys. There is another again, key build processing for the oldest runs on for that will be part of your work unit. So essentially, it's all the CO writer working with ECL submit their work unit, they still will be compile will run on your data, hopefully, they feel will be syntactically correct when you submit it. And it will run with giving you the resource that you were expecting on the data. You see. I mentioned this before, but he says surgical type language as well, which means that it is a little bit harder to errors that will only appear in runtime between the fact that he has no side effects. And that is typically typed most typing errors, type errors they've made in errors and they might into function operations errors are a lot less frequent. There is not like Python, but you may
0:30:17
seem okay. The
0:30:20
run may be fine. But then one run at some point it will give you some we are there because a variable that was supposed to have a piece of text has a number to revise the verse. So you run the work in it, they will give you the result as a result of this work unit will give will potentially give you some statistics and the data some metrics. And he will give you a number of keys, those keys will be also partitioned in four. So there will be filtered nodes, the keys will be partitioning them pieces in those nodes. And you will be able to play those keys as well from for Joe, you can write a few attributes that can do the quoting there. But at some point you will run to you will want write those queries for Roxy to us. And you will want to put the date and Roxy because you don't have one user creating the data you will have a million users going to query that data and perhaps a 10,000 of them will be simple things liquidating. So for the process, you write a another piece of ECL another sort of work in it, but we call this query and you submit that to Roxy instead of four. And there is a slightly different way to submit it to Roxy. So you select Roxy and you submit this, the difference between this query and they work in it I do the heat you have in four is that the query is parameter raised and similar to paradise to proceed in your database, you will find some variables that are supposed to be coming from the front end from the input from the user. And then you just use the values and those variables to run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from for i said before the keys are not mandatory, Roxy can perfectly work without keys can even cast a way to work with in memory distributed data sets as well. So even if you don't have a key, you don't pay a significant price in they look at by doing the sequential look up on the data and the full table scans of your database. So you submit that to Roxy, when you submit that query to Roxy, Roxy will realize that it has the data and it's not in Roxy's in for and this is also your choice, but most likely you will just tell Roxy to load the data from for it will know what to all the data from because he knows what are the keys are and what the names of those keys are, it will automatically load those keys. And also your choice to the Roxy to a stair allowing users to query the front an interface or to a while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically exposed on the front end there is a component called ESP, that component called DSP exposes a web services interface. And this gives you a restful interface, a soap interface, JSON for the payload, if you're going from the restful interface, even AM an old EBC interface if you want. So you can have unit even SQL and front on the front end. So the moment you submit the query, the query automatically generate out to generates these, all of these web service interfaces. So automatically, if you want to go with a web browser on the front end, or if you have an application that can use I don't know a restful interface over HTTP or HTTPS, you can use that and it will automatically have access to that Roxy quality that you submitted, of course, a single Roxy might have not one query but 1000 different queries at the same time, a all of them leasing an interface, and it can have several versions of the same of the queries as well. The queries are all exposed version from the front end. So you know, what they use is an accent. And if you are deploying a new version of equity or modified and extinguish it, you don't burn your users, if you don't want to, you give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process. Now, as you have committed to these while you need to have automation, all of these can be fully automated. In ECL, you may want to have data updates. And I told you data is immutable. So every time you think you're mutating data, you're updating data, you're really creating a new data set, which is good because it gives you full provenance, you can go back to your everyday version, of course, at some point, you need to delete data, or you will run out of space. And that can be also automated. And if you have updates on your
0:34:36
data, we have concepts like super files where you can apply updates, which are essentially new overlays from the existing data. And the existing work unit can just work on that, happily as if it was a single data set. So a lot of these complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So the developers if they don't want to see the underlying complexity, they don't need to, if they do they have the ability to do that I mentioned before Well, ECL will optimize things. So if you tell it, do this, join, but before doing the join to the sword, well, you may know that it is to us or to the sort of won't be that. But a if you know that your latest resorted, you might say, well, let's not do this, or I want to do this join each one of our politicians locally, instead of a global join, and order they are the same thing with sort of disorder operation and ECL of course, if you tell it to do that, and you know better than than the system, you see, I will follow your orders. If not, it will take the safe approach to your operation. Even if it's a little bit more overhead. Of course,
Tobias Macey
0:35:47
a couple of things that I'm curious about out of this are the storage layer of the HPC platform and some of the ways that you manage redundancy and durability of the data. And I also noticed when looking through documentation that there is some support for being able to take backups of the information, which I know is something that is non trivial when dealing with large volumes. And also on the Roxy side, I know that it maintains an index of the data. And I'm curious how that index is represented and maintained and kept up to date in the overall lifecycle of the platform.
Flavio Villanustre
0:36:24
Those are also very good question. So in the case of for for him Cassie concept. So we need to go down to a little bit of a system architecture. So in Thor you have each one of the nodes that handle a primarily they are chunk of data, they are partition of the data. But there is always a body node, some other node that has also their own partition, but they have a copy of the partition of some other nodes. If you have 10 nodes in your cluster view your node number one, I have the first partition and my have a copy the partition that no den has no number two might have a partition number two, but also might have a copy of the partition that no no number one has, and so on so forth. every node would have one primary partition and one backup partition of the other nodes every time you run a work unit. He said that you did he mutable, but you are generating a new data set every time that you are materializing data on the system, either by forcing it to materialize or a by letting the system materialize the data when it's necessary. And the system tries to stream as much in this way similar more similar to spark or or TensorFlow where the data can be streamed from acuity to acuity without being materialized. And like my previous and at some point, he decides that it's the time to materialize because the next operation might require materialized data or because you've been going for too long with data that if something goes wrong with the system will be blown up with every time it materializes data, the lazy copy happening with a new data has materialized to these backup nodes. So surely there is there could be a point where if something goes very wrong, and one of the nodes dies and the data in the disk is corrupted, but you know that you have always another know that has ad copy. And the moment you replace you do with known as Khufu essentially pull it out put another one in the system will automatically revealed that missing partition because it has complete redundancy of all of the data partitions in all the different nodes in the case of Roxy. So in the case of Florida seems to be sufficient, there is of course, the ability to do backups. And you can backup all of these partitions which are just files in the Linux file system. So you can even back them up using any Linux backup utility or you can use HPC to backup for you into any other system you can have cold storage, some of the problems is what happens is where your data center is compromised. And now someone modified or destroyed the data life system. So you may have you may want to have some sort of offline backup. And you can all handle this in the normal system backup configuration, or you can do it the HPC and make it offloaded as well. But for Roxy, the redundancy is a even more critical in the case of for when a node dies, it is sometimes less convenient to let the system work in a degraded way. Because the system is typically as fast as the slowest node. If all nodes are doing the same amount of work, a process it takes an hour will take an hour. But if you happen to have one know the die that now there is one know that he's doing twice the work because he has to do deal with two partitions of data its own and the backup of the other one, the time to the process may take two hours. So it is more convenient to just stop the process when something like that happens. The note and let the system rebuild that note quickly and continue doing the processing. And that might take an hour and 20 minutes or 10 minutes rather than the two hours that otherwise you would have taken. And besides if a system continues to run and your drive your storage system died in one knows because it's old and there is a chance that either the storage systems, when they get under the same stress will die the same way you want to replace that one quickly and have a copy. As soon as you can do not run the risk that you lose two of the of the partitions. And if you lose two partitions that are in different nodes that are not the backup of each other, that's fine. But if you lose the primary node, and the backup node for that one, there is a chance that you may end up losing the entire partition which is which is bad. Again, bad if you don't have a backup and Leland returning back of some things next time. So it's it's also inconvenient. Now and the Roxy case, you are you have a far larger pressure to have the process continue. Because your Roxy system is typically explosive all to online production customers that may pay you a lot of money for you to be highly available.
0:41:06
So Roxy allows you to have define the amount of realness that you want. based on the number of copies that you want, you could say, well, I haven't been a Roxy and as as need, which is the default, a one copy of the data or I need three copies of the data. So maybe they copy the partition, we know the one will be will have a copy in two, or three and four, and so on so forth. Of course, you need four times the space. But you have a far Hager resilience, if something goes very wrong, and Roxy will continue to work, even if a nose is down or you know, top down or, or as many notes as you want that down as long as you have the data is still fine. Because worst case scenario if even if it was a partition completely Roxy Mike, if you want to continue to run, but he won't be able to answer any queries that we're trying to leverage that particular partition that he's gone, which is sometimes not a good situation, when you you ask about the format of the keys and the formatting, they have the keys of the indexes in Roxy is interesting, because those keys, which is again, typically the format of the data that you have in Roxy, for the most part, you will have a primary key, these are all keys that are multi field like in a normal decent database out there. So they have multiple fields a they go, typically they those fields are all over by cardinality. So the fields with the larger cardinality will be at the front to make it more better performing. It has interesting abilities, like for example, you're going to step over a field that you don't have, you have a Wildcat for and still use the remaining fields, which is not something that normally a database doesn't do. Once you have a field that you don't have a value to apply, the rest of the fields on the right hand side are useless. And those Mexico's other things that are quite interesting there. But the way the data is stored in those keys is by decomposing those keys into two components, there is a top level component that indicates which node will have that partition. And there is a bottom level component, which indicates Where in the hell drive they have that a of the node, the specific data elements or the specific block of data elements are. So by decomposing the keys in these two hierarchical levels, it means that every node in Roxy can have the top level of that which is very small. So every node can know where to contact the specific values. So every note can be quoted from the front end, you have now a good scalability on the front end, you can have a load balancer and load balance all of the nodes. And it still on the back end, they can go back and know which node to ask for this when I said that the bottom level has the specific partition, I lied a little bit because he's not been no number one uses multicast. So nodes, when they have a partition of the data they subscribed a multicast channel, what you have in the top level is the number of the multicast channel that will handle that partition that allows us to make Roxy nodes more dynamic and handle. Also the fault tolerance situations where nodes go down. Well, it doesn't matter if you send the message to a multicast channel. Any know that is correct, we get the message, which one to respond well, he will be there faster note they know that is less burdened by other queries, for example. And if any know dies in the channel, it really doesn't matter. You're not stuck in a TCP connection waiting for the handshake to happen because they know the wind the way it is UDP, you send the message, and you will get the response. And of course, if nobody responded in a reasonable amount of time, you can resend that message as well,
Tobias Macey
0:44:53
going back to the architecture of the system, and the fact that how long it's been in double element and use and the massive changes that have occurred at the industry level as far as how we approach data management and the uses of data and the overall system that we might need to integrate with. I'm curious how the HPC platform itself has evolved accordingly. And some of the integration points that are available for being able to reach out to or consume from some of these other systems that organizations might be using,
Flavio Villanustre
0:45:26
we have changed quite a bit. So even those HPC systems name and some of the code base is resembles what we have 20 years ago, as you can imagine, any piece of software, he's a living living entity, and changes and evolved under that I've got as long as the communities active behind us, right. So we have changed significantly, we have not just added functionality, core functionality of HPC or change the functionality I had to adapt to times, but also build integration points. I mentioned a spark for example, and spark. Even though HPC is very similar to spark. spark is a large community around machine learning. So it is useful to integrate with the spark because many times people may be using spark ml. But they may want to use HPC for data management. And having a proper integration where you can run a spark ml and have on top of HPC is something that can be attracted to a significant amount of the HPC open source community. In other cases, like for example, Hadoop and HDFS axes are the same way integrations with other programming languages. Many times people don't feel comfortable programming everything in the CL and ECM works very well for a Data Manager something that is a data management centric process. But sometimes you have little components in the process, for example, that cannot be easily expressed in ECL is not in a way that is efficient.
0:46:55
And I don't know, I'll just throw one little unit together unique, unique ideas for things and you want to deny this you it is unique IDs in a random manner like UUIDs.
0:47:06
Surely you can call this and ECL, you can come back and come up with some crafty way of doing UCL. But he would make absolutely no sense to go to Denise EL, to then be compiled into some big chunk of c++, when I can go to directly in C or c++ or Python, or Java, or JavaScript. So being able to embed all of these languages into ECL became quite important. So we built quite a bit of integration for embedded languages is back even a few very major versions ago a few years ago, we added support for a I mentioned some of his language already Python, Java, JavaScript. And of course C and c++ was already available before. So people can add this little snippet songs functionality create attributes that are just embedded language type of attributes. And those are exposing CLS if they weren't UECO primitives. So now they have the ability of this and expand the ability of the core language to support new things without need to write them in a CL natively every time. And other there are plenty of other enhancements as well on the front end side. So I mentioned ESP ESP is this front end access layer, think of it as a some sort of message box in front of your Roxy system. In the past, we used to require that you code your ACL query for Roxy. And then you need to an ESP source recorded in c++. So you need to go to ESP and extend ESP with a dynamic model to support the front end interface for that query, which is twice the work. And you require someone that also knows c++ know just someone that knows ECL. So we change that. And we use something now that is called dynamic ESDL. That outdoor generates, as I mentioned before these interfaces from ESP, as you go this DCECL, all they want, they'll expect that you will put it there, you will call the query with some permitted eyes interface to a query. And then automatically GSB will take those parameters and expose those in this front end interface for for users to consume the decade, we also have done quite a bit of integration in systems that do that can help with the benchmarking of HPC. availability, monitoring, and performance monitoring all of the capacity planning of HPC as well. So we are we try to integrate as much as we can with our components in the open source community. We truly love open source projects. So if there is a project that already has done something that we can leverage, we try to stay away from reinventing the wheel every time we use it. If it's not open source, if it's commercial, we do have a number of integration with commercial systems as well. We are not to relate, we are not religious about it. But certainly it's a little bit less enticing to put the effort into something that is closed source. And again, we we believe that the model in open source, he says it's a very good model, because it gives us It gives you the ability to know how things have done under the hood and extended and fixed them if you need to. We do this all the time with our projects. We we believe that it has a significant amount of value for for anyone out there.
Tobias Macey
0:50:26
On the subject of the open source nature of the project, I know that it was released is open source. And I think you said the 2011 timeframe, which posts dates when Hadoop had become popular and started to accrue its own ecosystem. I'm curious what your thoughts are on the relative strength of the communities for Hadoop and HPC. Currently, given that there seems to be a bit of a decline in Hadoop itself as far as the amount of utility that organizations are getting from it, but also interesting in the governance strategy that you have for the HPC platform and some of the ways that you approach sustainability of the project.
Flavio Villanustre
0:51:08
So you're absolutely right, the community has apparently at least reached a plateau at psychological and HPC systems community, in number of people. Of course, it was the first to the open. So we have HVC for a very long time he was closed source, he was proprietary, and we didn't we at the time, we believed that he was so core to our competitive advantage that we couldn't afford to release it in any other way. When we realized that reality, the core advantage that we have is on one side data assets on the other side is the high level algorithms. We knew that the platform would be better sustained in the long Randy and sustainability is an important factor for the platform for us because the platform is so core to everything we do that we believe that making it open source and free, completely free as both a no just a freedom of speech, but also free beer. We we thought that that would be the way to ensure this long term sustainability and development and an expansion and innovation in the platform itself. But when we did that it was 2011. So it was a few years after Hadoop, Hadoop, if you remember, it started as part of another project around the web crawling and what called management, which eventually ended up It's a song top level Apache project in 2008, I believe. So it was already three or four and a half years after hundred was out there. And they're coming to us really large. So over time, we did gather a fairly active community. And today we have inactive a very technical, deeply technical community. That is that not just a helps with extending and expanding HPC, but also provides a VS use cases, sometimes interesting use cases of HPC and a and uses HPC in general and regular regularly. So he would it be system community continues to grow, the community seems to have reached a plateau. Now there are other communities out there, which also handle some of the data management aspects with their own platforms like spark I mentioned before, which seems to have a better performance profile than what Hello Cass. So it has been also gathered in active, active people in those communities. Well, I think open source is not a zero sum game where if a community grows, the other one will decrease and then eventually, the total number of people in the community will be the same across all of them. I think every new platform that introduces capabilities to open source communities and uses new ideas and and helps develops, apply innovation into those ideas is helping the overall community in general. So it's great to see communities like a spark community growing. And I think there's an opportunity, and many of the users in both communities are using both at some point for all of them to leverage what is that in the others. Surely, sometimes, the specific language using gold in the platforms, makes a little bit of a bit created a little bit of a barrier. Some of these communities are now just because of the way Java is potentially more common, that use Java instead of c++ and C. So you see that sometimes the people that are in one community who may be more versed in Java, feel uncomfortable going and trying to understand the code in the other platform that is coded in a different language.
0:54:52
But even even then, at least
0:54:55
semi generally VSVLO difference on the on the functions I capabilities can be extracted and used to be added. And I think this is good for the overall benefit of everyone. I see, in many cases open source as a as a experimentation playground, where people can go there can bring new ideas, apply those ideas to some code, and then everyone else eventually leverages them because these ideas percolate across different projects. It's It's It's quite interesting. Having been involved personally in open sources, the early 90s. I I'm quite fond of the of the process, open source work. I think it's it's beneficial to everyone in the in every community.
Tobias Macey
0:55:37
And in terms of the way that you're taking advantage of the HPC platform, Lexis Nexis and some of the ways that you have seen it used elsewhere. I'm wondering what are some of the notable capabilities that you're leveraging and some of the interesting ways that you've seen other people take advantage of it?
Flavio Villanustre
0:55:54
that's a that's a good question. And
0:55:56
that my the answer might take a little bit longer. So in the in Lexis Nexis, in particular, certainly we use HPC. For almost everything we do, because almost everything we do is data management in some way or data quality. Now, we have interesting approaches to things is we have a number of processes that are done on a on data. One of those is this prolific linkage process. And prolific linkage requires sometimes quite a bit of code to make it work correctly. So there was a point where we were ability to finish EL and he was creating a code base that was getting more and more sizable, larger, bigger, less manageable. So at
0:56:39
some point, we decided
0:56:41
that level of abstraction that is pretty high anyway, in ECL, wasn't enough for prolific data linkage. So we created another language we called it sold and we the unrelated language is open source, by the way, it's still providing, but that language is a language that is you're going to consider it a domain specific language for data Liggett productively only get and data integration, so that a compiler for salt, compile salt into CL, and they feel compelled by this EL into c++, c++, clang or GCC compiler into assembler. So you can see how abstraction layers or like layers in an audience, of course, every time you apply an improvement and optimization in the sale compiler, or sometimes the GCC compiler team applies an optimization. And you see everyone else on top of that, of that layer benefits from the optimization, which is quite interesting. We like it so much that eventually we have another problem, which is dealing with graphs. And when I say graphs, I mean social graphs rather than
0:57:46
charts.
0:57:47
So we built yet another language that deals with graphs and machine learning, and particularly machine learning in graphs, which is called Cal or knowledge engineering language, which by the way, we don't have an open source version, but we do have version of the compiler out there for people that want to try. So Gil, also generation CL, and E LD, my c++ and again, back to the same point. So this is a is an interesting approach to building abstraction by Creek, DSL, domain specific languages on top of ACL and other interesting application of HPC, outside of Nexus Nexus is there is a company that is it's called guard pad, they do have that are smart, they can do geo fencing for workers, they can do a detection of of risky environments, in manufacturing environment or in construction. And so they use HPC. And they use some of their real time integration that we have a with things like Afghan couch to be and other integrations I mentioned that we have word activity on integrating HPC with other open source projects to essentially manage all of these data, which is fairly real time.
0:58:57
And a create
0:58:58
this real time Allah and then real time, machine learning, execution for models that they have and integration of data and even visualization on top of it. And and there are more and more a good I could go for days, giving you some some of the ideas there of things that we have done an hour and or others in the community have done using HPC.
Tobias Macey
0:59:21
And in terms of the overall experience that you have had working with HPC on both the platform and as a user of it, what have you found to be some of the most challenging aspects and some of the most useful and interesting lesson for you've learned in the process?
Flavio Villanustre
0:59:38
That is a great question. And
0:59:40
I'll give you a very simple answer. And then I'll explain what I mean. What are some of the biggest challenges, if you are a new user is ECL. Some of the biggest benefits are ACL. Unfortunately, no, not everyone is, is well versed in declarative programming models. So when you are exposed for the first time to a declarative language that
1:00:04
has immutability and laziness. And
1:00:09
the no side effects, it makes sometimes a little bit of a brain twister in some way, right there, you get to, you need to think the problems in a slightly different way to be able to solve them. When you install that it used to embed the programming, you typically solve the problem by decomposing the problem into a just a recipe of things that the computer this process needs to do, step by step one by one, when you do the collective programming, you decompose the problem in a set of functions that need to be applied, and you build it from the ground up. This is slightly different type of, of approach. But it once you get the idea how this works, it becomes quite powerful for a number of reasons, it becomes quite powerful, because first of all, you get to understand the problem more, and you can express the algorithms in a far more succinct way, it would have been just a collection of attributes. And some of the attributes depend on other attributes that you have defined, it also helps you with better encapsulate the components in the problem. So now you're cold instead of becoming just some sort of a spaghetti that is hard to troubleshoot is willing calculated, both in terms of function and calculation, also dating calculation. So if you need to touch anything later on, you can do it safely without need to worry about what this function could be doing that I'm calling here to any to go and also look at the function because you know, there are no side effects. And it also gives you the ability to ECA if you of course, as long as you name your attributes correctly. So people understand what they they are attempting to do, are they they are supposed to do, you can collaborate more easily with other people as well. So after a while, I realized that I was building code in ECLM, and others have also the same way, then realize that they coded the writing the CL is, first of all, mostly correct most of the time, which is not what you do when you have a non declarative code programming. And you know that if the code compiles, there is high chance that the code will run correctly. And it will give you a correct results after it runs. And like I say, was explained before when you have a dynamically typed language is imperative programming with side effects were, surely they called my compile, and maybe it will run fine if your times, but one day, it may give you some sort of runtime error, because some type is mismatch or some side effect that you consider when you re architect some piece of the code now is kicking in and getting your your results different from what you expected. I think, again, a CL has been really quite quite a blessing from that standpoint. But of course, it does require that you learn this you want to learn and you learn this new methodology of programming, which could be similar to what someone that knows, Python or Java needs to learn in order to apply SQL and execute against another declarative language. So use on code SQL interactively. When you are trying to
Tobias Macey
1:03:34
query a database looking forward in terms of the medium to long term, as well as some of the near term for the HPC platform itself. What do you have planned for the future, both in terms of technical capabilities and features, but also as far as community growth and outreach that you'd like to see.
Flavio Villanustre
1:03:53
So from the technical capabilities and features, we tend to have a community roadmap of things and try to as much as we can to stick with those roadmap. So we have some, these big ideas that tend to get into the next or the following major version, these smaller ideas that are typical, non disruptive, and don't break past compatibility that go into the minor versions. And then of course, these bug fixes.
1:04:23
Like many say they are not bugs, but opportunities.
1:04:26
But in the great
1:04:28
at the big ideas side of things, some of the things that we've been doing is doing better integration of I mentioned before integration with other open source projects is quite important. We've been also trying to change some of the underlying components in the blood, there were some components that we have had for a very, very long
1:04:46
time, like, for example, the
1:04:48
underlying communication layer, in Roxy. And for that we think they may be right now for a further revamping, by incorporating some of the standard communications out there. There is also the idea of making the platform far more cloud friendly, even though it does run very well in many public clouds and OpenStack, and Amazon and Google and Azure. But we want to also make the clusters more dynamic. I don't know if you spotted when I said that when you when I explain how you do data management, we're too busy. And he said, Well, you have a note for Well, what happens when you want to change the tenor thought and make it a 20 or 30, or a five notes, or maybe you have a small process, that would work fine with just a couple notes or one knows, you have a large process that may need 1000 nodes. Today, you can dynamically resize the four cluster, surely you can do every if you can resize it by hand, and then do a reboot of the data and now have the data in the number of nodes that you have. But it is a lot more involved than we would like to see it with dynamic cloud environments, the facilities becomes quite important because that's one of the benefits of cloud. So making the classes also more elastic. more dynamic is another big goal. Certainly, we continue to develop machine learning capabilities. On top of it, we have a library of machine learning functions of their algorithms methods. And we are expanding that we sometimes have even some of these machine learning methods, which are quite, I would say innovative one of our core developers and also researchers developed a new distributed algorithm for K means clustering, which she hasn't seen in the literature before. So it's part of one one a paper and her PhD dissertation, which is very good. And the other one is also part of HPC. Now, so now people can leverage this, which gives a significantly higher scalability to K means, particularly if you're running a very large number of nodes, I'm going to get into the details and how it is it creates he said this far better performance. But in in some it distributes the data less. And instead the students the center, it's more and it uses the associative property of the gaming, the main loop of played k means clustering to try to minimize the number of data records that need to be moved around. That's it from the standpoint of the roadmap and the platform itself. On the community side, we continue to try to expand the community as much as we can. One of our core interests is to get I mentioned this core developer who is a also researcher, we want to get more researchers and an academia on the platform, we have a number of initiatives, collaboration initiatives, with a number of universities in the US and abroad university like Oxford University in the UK, University College London, Humboldt University in in Germany, and a number of universities in the US, Clemson University, Georgia Tech and Georgia State University and Annika so we want to expand this program more, we also have an internship program, we believe that one of the one of the things that we see are the goals that we want to achieve as well with with the HPC systems project open source project is to also help balance better the community behind it from balancing diversity across the community. So attracting both both generally but in general, generally vertically and regionally about diversity and background diversity. So we are trying to also put quite a bit of emphasis in students, even high school students, so we are doing quite a bit of activity with high schools, on one side trying to get them more into technology. And of course, learn HPC, but also the outside try to also get more women into technology get more people that otherwise wouldn't get into technology, because they don't get exposed to technology in their homes. And so that's another core piece of activity in HPC, the HPC community. Last but not least, as part of this diversity, there are certain communities that are a little bit more disadvantaged than others. One of those is people in the autism spectrum. So we have been doing quite a bit of activity with organizations that are helping these, these people. So also trying to enable them with a number of activities. And some of those have to do with training them into HPC systems as a platform and data management to give them open opportunities for them for their lives. Many of these individuals are extremely intelligent, they're they're brilliant, they may have other limitations because of their, their conditions. But they will be very, very valuable resources, not just flexible solutions. Ideally, we could tell you there but even for other organizations as well,
Tobias Macey
1:09:48
it's great to hear that you have all these outreach opportunities as well for trying to help bring more people into technology as a means of giving back as well as as a means of helping to be your community and contribute to the overall use cases that it empowers. So for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today,
Flavio Villanustre
1:10:19
I think there are a number of gaps, but the major one is, many of the platforms that are out there tend to be quite clunky, when it comes to integrating things. Unfortunately, we are at the point where we are not, I don't think we are mature enough. So I'm mature enough. I mean, if if you are a data management person, you know data very well, you know, data analytics, you know, data process, but you don't necessarily know operating systems, you don't know, you are not a computer scientist that can deal with data partitioning and computational complexity of algorithms in partition data. And, and there are many details that are necessary for you to do your job should be unnecessary for you to lose your job correctly. But unfortunately, today because of the state of things, many times many of these systems commercial and non commercial force you to take care of all of the details or assemble a large team of people, from system administrators to network, network administrators to operating system specialist to a middle layer, especially some build, you can build a system that can you do your data management, the and that's something that we we do try to overcome with HPC giving the screen in this homogeneous system that you deploy with a single command and that you can use a minute later, after you deployed it, I will say that we are in the ideal situation yet I think there is still much to improve on but I think we are a little bit further along than many of the other options out there. You if you know the the Hadoop ecosystem, you know, how many different components of that are out there. And you know, if you have done this for for a while, you know that one day you realize that they said either know a security vulnerability in one component MB, and now you need to update that. But in order to do that, you're going to break the compatibility of the new version with something else. And now you need to update that other thing. But there is no update for another thing, because that thing depends on another component. So yeah, and this goes on and on and on. So having something that is homogeneous, that it doesn't require for you to be computer scientist to deploy and use. And that truly enables you are the abstraction layer that you need, which is data management is a is a significant limitation of many, many systems out there. And again, not just pointing this at the open source projects, and also commercial product as well. I think it's something that some of the people that are designing and developing the systems might not understand because they are not the users. But they should think as a user, you need to put yourself in the shoes of the user in order to be able to do the right thing. Otherwise, whatever you build is pretty difficult to apply. Sometimes it's useless.
Tobias Macey
1:13:03
Well thank you very much for taking the time today to join me and describe the ways that HPC is built and architected as well as some of the ways that it's being used both inside and outside of Lexis Nexis. So I appreciate all of your time and all the information there. And it's definitely a very interesting system and one that looks to provide a lot of value and capability. So I appreciate all of your efforts on that front. And I hope you enjoy the rest of your day.
Flavio Villanustre
1:13:30
Thank you very much. I really enjoyed this and I look forward to doing this again. So one day we'll get together again. Thank you

Scale Your Analytics On The Clickhouse Data Warehouse - Episode 88

Summary

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Clickhouse is and how you each got involved with it?
    • What are the primary use cases that Clickhouse is targeting?
    • Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
  • Can you describe how Clickhouse is architected?
  • Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
    • I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
  • Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
    • For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
  • What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
  • For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
  • Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
  • How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
    • How is data replication and consistency managed?
  • What is involved in deploying and maintaining an installation of Clickhouse?
    • I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
  • What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
  • What are the shortcomings of Clickhouse and how do you address them at Altinity?
  • When is Clickhouse the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Managing The Machine Learning Lifecycle - Episode 84

Summary

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Hydrosphere is and share its origin story?
  • In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
    • How does it differ from deployment and maintenance of a regular software application?
  • Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
  • For someone who is using Hydrosphere in their production workflow, what would that look like?
    • What is the difference in interaction with Hydrosphere for different roles within a data team?
  • What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
    • Which metrics do you track for testing and verifying the health of the data?
  • What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
  • How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
    • How has that influenced the design and direction of Hydrosphere, both as a project and a business?
    • How has the design of Hydrosphere evolved since you first began working on it?
  • What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
  • What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
  • What do you have in store for the future of Hydrosphere?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Lineage For Your Pipelines - Episode 82

Summary

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pachyderm is and how it got started?
    • What is new in the last two years since I talked to Dan Whitenack in episode 1?
    • How have the changes and additional features in Kubernetes impacted your work on Pachyderm?
  • A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?
  • Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?
    • How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?
  • There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?
  • Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?
    • What is the interface for exposing and exploring that provenance data?
  • What are some of the advanced capabilities of Pachyderm that you would like to call out?
  • With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?
  • What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?
  • What do you have planned for the future of Pachyderm?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Build Your Data Analytics Like An Engineer - Episode 81

Summary

In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DBT is and your motivation for creating it?
  • Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline?
  • Can you talk through the workflow for someone using DBT?
  • One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented?
  • The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?
    • Are these packages driven by Fishtown Analytics or the dbt community?
  • What are the limitations of modeling everything as a SELECT statement?
  • Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?
    • What are your thoughts on higher level approaches to SQL that compile down to the specific statements?
  • Can you explain how DBT is implemented and how the design has evolved since you first began working on it?
  • What are some of the features of DBT that are often overlooked which you find particularly useful?
  • What are some of the most interesting/unexpected/innovative ways that you have seen DBT used?
  • What are the additional features that the commercial version of DBT provides?
  • What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT?
  • When is it the wrong choice?
  • What do you have planned for the future of DBT?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80

Summary

The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you explain what FoundationDB is and how you got involved with the project?
  • What are some of the unique use cases that FoundationDB enables?
  • Can you describe how FoundationDB is architected?
    • How is the ACID compliance implemented at the cluster level?
  • What are some of the mechanisms built into FoundationDB that contribute to its fault tolerance?
    • How are conflicts managed?
  • FoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available?
    • Is it possible to apply different layers, such as relational and document, to the same underlying objects in storage?
  • One of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput?
  • For someone who wants to run FoundationDB can you describe a typical deployment topology?
    • What are the scaling factors for the underlying storage and for the Layers that are operating on the cluster?
  • Once you have a cluster deployed, what are some of the edge cases that users should watch out for?
    • How are version upgrades managed in a cluster?
  • What are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB?
  • What are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used?
  • When is FoundationDB the wrong choice?
  • What is in store for the future of FoundationDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA