Linode

Navigating Boundless Data Streams With The Swim Kernel - Episode 98

Summary

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Swim.ai is and how the project and business got started?
    • Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
  • What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
  • How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
  • Can you describe a typical design for an application or system being built on top of the Swim platform?
    • What does the developer workflow look like?
      • What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
  • Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
  • For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
    • What mechanisms are in place to account for network failures?
  • Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
  • Since there is no explicit data layer, how is data redundancy handled by Swim applications?
  • What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
  • What have you found to be the most challenging aspects of building the Swim platform?
  • What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
  • What do you have planned for the future of the technical and business aspects of Swim.ai?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Lynn ODE with 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering, podcast.com, slash Lenovo, that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and media or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence data Council, upcoming events and do Riley AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Simon Crosby about swim.ai, the Data Fabric for the distributed enterprise. So Simon, can you start by introducing yourself?
Simon Crosby
0:02:28
Hi, I'm Simon Crosby, I am the CTO, I guess of long duration. I've been around for a long time. And it's a privilege to be with the swim folks who have been building this fabulous platform for streaming data for about five years.
Tobias Macey
0:02:49
And do you remember how you first got involved in the area of data management?
Simon Crosby
0:02:53
Well, I have a PhD in applied mathematics and probably, so I am kind of not data management guy. I'm an analysis guy. I like what comes out of, you know, streams of data and what influence you can draw from it. So my background is more on the analytical side. And then along the way, I saw begin to how to build big infrastructure for it.
Tobias Macey
0:03:22
And now you have taken up the position as CTO for swim.ai, I'm wondering if you can explain a bit about what the platform is and how the overall project and business got started?
Simon Crosby
0:03:33
Sure. So here's the problem. We're all reading all the time. But these wonderful things that you can do with machine learning, and streaming data, and so on, it all involves cloud and other magical things. And in general, most organizations chest don't know how to make head or tail of that, for a bunch of reasons, it's just too hard to get there. So if you're an organization, with assets, that are chipping out lots of data, and that could be a bunch of different types, you know, you probably don't have the skill fit in house to deal with a vast amount of information. And we're talking about boundless data sources, yet things that never showed up. And so to deal with these data flow pipelines to deal with it itself, to deal with the learning and inferences you might draw from that, and so on. And so, enterprises, a huge skill set challenge. There is also a cost challenge, because today's techniques related to drawing inference from data in general resolve with it, you know, in large, expensive, dead legs, either in house or perhaps in the cloud. And then finally, there's a challenge with the timeliness within which you can draw an insight. And most folks today, believe that you store data, and then you think about it in some magical way. And you draw inference from that. And we're all suffering from the Hadoop Cloudera, I guess, after effects, and really, this notion of storing and then analyzing needs to be dispensed with in terms of fast it, certainly for boundless data sources that will never stop. It's really inappropriate. So when I talk about boundaries, today, we're going to talk about data streams that just never stop. And Ross can talk web, the need to derive insights from that data on the fly, because if you don't, something will go wrong. So it's of the type that would stop your car before you hit the pedestrian, the crosswalk, that kind of stuff. So for that kind of data, there's just no chance to know still down hard disk. And then
Tobias Macey
0:06:16
and how would you differentiate the work that you're doing with the swimming AI platform and the swim OS kernel from things that are being done with tools such as Flink or other streaming systems, such as Kafka that is now got capabilities for being able to do some limited streaming analysis on the data as it flows through, or also platforms such as wall a room that are built for being able to do state for computations on data streams?
Simon Crosby
0:06:44
So first of all, there have been some major steps forward. And anything we do we stand on the shoulders of giants. Let's start off with distinguishing between the large enterprise skill set that's out there, and the cup world. And all the things you mentioned live in the cloud world. So at that reference distinction, most people in the enterprise when you said Flink wouldn't know what the hell you talking about. Okay, similarly will ruin anything else, they just wouldn't know what you're talking about. And so there is a major problem with the tools and technologies that are built for the cloud, really for against for log cloud native applications, and the majority of enterprises who just their step with legacy IT and application skill set, and they still come up to speed with the right thing to do. And to be honest, they're getting over the headache of Hadoop. So then, if we talk about cloud native world, there is a fascinating distinction between all the various projects, which have started to tackle streaming data. And there have been some major progress has been made some major progress there, Jim be delighted to point out some being one of them, and have been going into each one of those projects in detail as we go forward. The key point being that, first and foremost, the large majority of enterprises just don't
Tobias Macey
0:08:22
know what to do. And then within your specific offerings, there is the data Friedberg platform, which you're targeting for enterprise consumers. And then there's also the open source kernel of that in the form of swim OS. I'm wondering if you can provide some explanation as to what are the differentiating factors between those two products and the sort of decision points along when somebody might want to use one versus the other?
Simon Crosby
0:08:50
Yeah, let's cut it first at the distinction between the application layer and the infrastructure needed to run largest distribute data for pipeline. And so for swim all of the application layer stuff that then there's everything you need to build nap is entirely open source. Some of the capabilities that you want to run a large distributed data pipeline are proprietary. And that's really just because, you know, we're building a business around this, we plan to open source more and more features more and more features over time.
Tobias Macey
0:09:29
And then as far as the primary use cases that you are enabling with the swim platform, and some of the different ways that enterprise organizations are implementing it, what are some of the cases were using something other than swim, either the OS or the Data Fabric layer would be either impractical or intractable if they were trying to use more traditional approaches such as Hadoop, as you mentioned, or data warehouse and more batch oriented workflows?
Simon Crosby
0:09:58
So So let's start off describing what swim does, can it can I do that, that that might help our in our view, it's our job to build the pipeline, and indeed the model from the data. Okay, so swim, just once data, and from the data we will build, automatically build this typical data flow pipeline. And indeed, from that, we will build a model of arbitrarily interesting complexity, which allows us to solve some very interesting problems. Ok. So the swim perspective, starts with data. Because that's where our customers journey starts. They have lots and lots of data, they don't know what to do with it. And so the approach we take and swim is to allow the data to build the model. Now, you would naturally say that's impossible, in general, but requires is some oncology at the edge, which describes the dead, you could think of it as a schema, in fact, basically, to describe what data items mean, in some sort of useful sense to us as modelers. But then given data swim will build that model. So let me give you an example. Given a relatively simple ontology, for traffic, and traffic equipment, so position lights, the loops, and the road, the lights and so on, swim will build a model, which is a staple, digital twin is where for every sensor, every in every source of data, which is running in concurrently in some distributed fabric, and processes its own raw data and truly evolves, okay. So simply given that ontology, some knows how to build, stay faithful, concurrent, little things we call web engines, actually, I'm yeah, I'm using that term,
0:12:18
I guess the same as digital twin.
0:12:21
And these are concurrent things which are going to stay fluid process raw data and represent that in a meaningful way. And the cool thing about that is that each one of these little digital twins exists in a context, a real world context, that term is going to discover for us. So for example, a an intersection might have 60, to 80, sensors. So this notion of containment, but also, intersections are adjacent to other sections in the real world map. And so on. That notion of a Jason's is also real world relationship. And in swimming, this notion of a link allows us to express the real world relationships between these little digital twins. And linking in swim has this wonderful additional property, which is to allow us to express it essentially, as soon swim, there is never a pub, but there is up. And if something links to something else so filing to you, then it's like LinkedIn for things, I get to see the real time updates of in memory state still buy that digital twin. So digital twins, a link to a digital twins courtesy of real world relationships, such as containment or proximity. We can even do other relationships, like correlation,
0:14:05
also linked to each other, which allows them to share data.
0:14:09
And sharing data allows interesting computational properties to be derived. For example, we can learn and predict. Okay, so job one is to define the songs ology something goes and builds a graph, and a graph of digital twins, which is constructed entirely from the data. And then the linking happens as part of that. And that allows us to then construct interesting competitions.
0:14:45
Is that useful?
Tobias Macey
0:14:46
Yes, that's definitely helpful to get an idea of some of the use cases and some of the ways that the different concepts within swim work together to be able to build out to what a sort of conceptual architecture would be for an application that would utilize swim.
Simon Crosby
0:15:03
So the key thing here is I'm talking about an application bit just said, the application is to predict the future, the future traffic in a city, or what's going to happen in the traffic area right. Now, I could do that for a bunch of different cities, what I can tell you is I need a model for each city. And there are two ways to build a model. One way is I get a data scientist to have them build them, or maybe they train it and a whole bunch of other things. And I'm going to have to do this for every single city where I want to use this application. The other way to do it is to build the model from the data. And that's the approach. So what swim does is simply given the ontology, build these little digital twins, which are representatives of the real world things, get them to stay fully evolve, and then link to other things in, you know, to represent real world relationships. And then suddenly, hey, presto, you have built a large graph, which is effectively the model that you would have had to average a human build otherwise, right? So it's constructed in the sense that in any new city you go to this thing is just going to unbundle and just given a stream of data, it will build a model, which represents the things that are the sources of data and their physical relationships. That make sense.
Tobias Macey
0:16:38
Yeah, and I'm wondering if you can expand upon that, in terms of the type of workflow that a developer who is building an application on top of swim would go through as far as identifying what those ontology is, are and defining how the links will occur as the data streams into the different nodes in the swimming graph.
Simon Crosby
0:17:01
So the key point here is that we think that we will do, and then we can build, like 80% of a nap, okay, from the data. And that is we can find all of the big structural, red, all structural properties of relevance in the data, and then let the, the application builder drop in what they want to compute. And so let me try and express is slightly differently. Job, one, we believe is to build a model of the staple digital twins obby, which almost mirror their real world counterparts. So at all points in time, their job is to represent the real world, as faithfully and as close to real time as they can in a stable way, which is relevance to the problem at hand. Okay, so rather involved, so I'm going to have a red light, okay, something like that. And the first problem is to build this, the central digital twins, which are interlinked, which represent the real world being said, okay, and it's important to separate that, from the application layer component of what you want to compute from that. So frequently, we see people making the wrong decision that is hard, hard coupling, the notion of prediction, or learning or any other form analysis into the application in such a way that any change requires programming. And we think that that's wrong. So job one is to have this faithful representation of a real time world in which everything evolves its own state, whenever it's real world when evolves, and evolves stay pretty. And then the second component to that is, which of which we do on a separate timescale is to inject operators, which are going to then compute on the states of those things at the edge, right. So we have a model, which represents the relationships between things in the real world. It's attempting to evolve as close as possible to real time in relationship to the real world twin, and it's reflecting its links and so on. But the notion what you want to compute from it is separate from that and decoupled. And so the second step, which is an application, or building an application right here, right now, is to drop in an operator, which is going to compute a thing from that. So you might say, cool, I want every digital, every intersection to compute, you know, to be able to learn from its own behavior and predict. That's one thing, we might say, I want to compute the average wait time of every kind and see, that's another thing. So the key point here is that computing from these rapidly evolving world worldviews, is decoupled from the actual model of what's going on in that world and point in time. So it's from reflects that decoupling by allowing you to bind operators to the model whenever you want.
0:20:45
Okay,
0:20:46
bye whenever you want. I mean, you can write them in code and bits of job or whatever. But also, you can write them in blobs of JavaScript or Python, and dynamically insert them into a running model. Okay, so let me make that one concrete for you. I could have a deployed system, which is a model a deployed graph of digital twins, which are currently mirroring the state of Las Vegas. And data dynamically, a data scientist says, Let me compute the average wait time of red cars at these intersections, and drop said in as a blob of JavaScript attached to every digital twin for an intersection. That is what I mean by an application. And so we want to get to this point where the notional application is not something deeply hidden in somebody's, you know, notebook, or Jupiter notebook, or in some program his brain and they quit and wander off to the next startup 10 months ago, an application is what everyone or no right now grew up into a running model.
Tobias Macey
0:22:02
So the way that sounds like to me is that swim essentially acts as you deploy the infrastructure layer to ingest the data feeds from the sets of sensors, and then it will automatically create these digital twin objects to be able to have some digital manifestation of the real world so that you have a continuous stream of data and how that's interrelated. And then it sort of flips the order of operations in terms of how the data engineer and the data scientists might work together, where the way that most people are used to, you will ingest the data from these different sensors, bundle it up, and then hand it off to a data scientist to be able to do their analyses. They generate a model and then hand it back to the data engineer to say, Okay, go ahead and deploy this and then see what the outputs are where instead, the swim platform essentially acts as the delivery mechanism and the interactive environment for the data scientists to be able to experiment with the data, build a model, and then get it deployed on top of the continuously updating live stream of data, and then be able to have some real world interaction with those sensors, in real time, as they're doing that to be able to feed that back to say, okay, red cars are waiting 15% longer than other cars at these two intersections, and I want to be able to optimize our overall grid, and that will then feed back into the rest of the network to have some physical manifestation of the analysis that they're trying to perform to try and maybe optimizing all traffic.
Simon Crosby
0:23:39
So there are some consequences for that, first of all, the every algorithm has to compute stuff on the fly. So if you look at, you know, the kind of store and then analyze approach to Big Data Type learning, or training or anything else, you know, you have a little bit here, you don't. And so every algorithm that is part of swim, is coded in such a way is to continually process data. And that's fundamentally different to most frameworks. Okay, so for example,
0:24:19
the,
0:24:21
the Learn and predict cycle is what, you know, you mentioned training, and, and so on. And that's very interesting. But no train flies, that I collect and store some train data, and that it's complete and useful enough to try the model back and then hand back. You know, what, if it isn't, and so, in whom we don't do that, mean, we can if you want, if you have a bottle less no problem for us to be dead, too. But instead, in swim, the input vector, say to a prediction, I will say DNA is precisely the current state of the digital twins for some bunch things, right? Maybe the set of sensors in the neighborhood of the urban intersection. And so this is a continuously burying real world triggered scenario in which real data is fed through the algorithm, but is not stored anywhere. So everything is fundamentally streaming. So we assume that data streams continually and indeed, the output of every algorithm streams continually. So what you see when you compete in the average is the current average. Okay, when you see when you when you're looking for heavy hitters, the what you see as the current heavy hitters. All right. And so every algorithm has its streaming, twin, I guess. And and part of the art in the same context is reformulating the notion of of analysis into a streaming context. So that you never expect a complete answer, because there isn't one is just what I've seen until now. Okay, and what I've seen until now has been fed through the algorithm, this is the current answer. And so every algorithm, compute and streams. And so the notion of linking, which I described earlier for swim between digital twin say, applies also to these operators, which effectively would link to things they want to compute from, and then they would stream their results. Okay, so if you LinkedIn, you see a continued update. And for example, that stream could use to be could be used to feed a Kafka cathkin limitation, which would serve a bunch of applications, you know, the notion of streaming is, is pretty well understood. So we can feed other bits of the infrastructure very well. But fundamentally, everything is designed to stream,
Tobias Macey
0:27:21
it's definitely an interesting approach to the just overall workflow of how these analyses work. And one thing that I'm curious of is how data scientists and analysts have found working with this platform in terms of ways that they might be used to, you know, you're interested in, in what they scientists would view or how they view this,
Simon Crosby
0:27:45
to be honest, in general with surprise.
0:27:50
Our experience today has been largely with people who don't know what the heck they're doing in terms of data science. So they're trying to run an oil rig more efficiently they have, what about 10,000 sensors, and they want to make sure this thing isn't going to blow up. Okay? So tend to be heavily operationally focused folks. They're not that scientists, they never could afford one. And they don't understand the language of data science, or have the ability to build cloud based pipelines that you and I might be familiar with. So these are folks who effectively just want to do a better job, given this enormous stream of data they have, they believe they have something in the data, they don't know what that might be. But they came to go and see. Okay. And so those are the folks who spent most of our time with, I'll give you a funny example, if you'd like char man.
Tobias Macey
0:29:00
illustrative,
Simon Crosby
0:29:02
we work with a manufacturer of aircraft.
0:29:05
And they have very large number of RFID tag parts, and equipment to and so if you know anything about RFID, you know, it's pretty useless stuff is built from about 10 years ago, 20 years ago. And so what they were doing is from about 2000, readers, again, about 10,000 reads a second. And each one these read is simply being written into an oracle database, at the end of the day that try and reconcile the soul with what whatever parts have, and wherever the thing is, and so on. And this whom solution to this is entirely different, it gives you a good idea of why we care about modeling data, or thinking about data differently. We simply built a digital twin for every tag, the first time it's seen, we create one. And then they know they have been in for a long time, they just expire. And whenever a reader sees attack, it simply says, Hey, I saw you. And this was the signal strength. Now, because tanks get seen by multiple readers, the each digital 12 attack does the obvious thing. It triangulate from the readers. Okay, so it learns the attenuation different parts of the plot. It's very simple initially, it that's the word learn there is a rather stretch to British straightforward calculation, and then suddenly can work out where it is in three space. So instead of an oracle database, or a database full of tag berries, and lots and lots of post processing, you know, but the Kepler Raspberry Pi's and each one NNE, Raspberry Pi's, you know, have millions of these tanks running, and then you can ask any one of them where it is. Okay, and you then you can do even more, you can say, hey, show me all the things within three meters of this tech, okay? And that allows you to see components being put together into real physical objects, right? So as a fuse, ours gets built up the engine or whatever it is. And so a problem, which was tons of infrastructure, and tons of taghreed got tend to Raspberry Pi's with stuff, which kind of self organized and into a phone, which could feed real time visualization and controls around what what bits of infrastructure were.
0:31:52
Okay. Now, that
0:31:54
was transformative for this outfit, which was, which quite literally had for tackling the problem this way.
Tobias Macey
0:32:02
Does that make sense? Yeah, that's definitely very useful example of how this technology can flip the overall order of operations and just the overall capabilities of an organization to be able to answer useful questions. And the idea of going from, as you said, an Oracle Database full of all these just rows and rows of records of this tag, read at this point in this location. And then being able to actually get something meaningful out of it. As far as this part is in this location in the overall reference space of the warehouse is definitely transformative, and probably gave them weeks or months worth of additional time in terms of lead time for being able to predict problems or identify areas for potential optimization.
Simon Crosby
0:32:47
Yeah, I think we said them $2 million a year. Let me tell you what, from this tale come two interesting things. First of all, if you show up at customer service running on Raspberry Pi, you can charge them a million bucks. Okay, that's less than one lesson too is that the volume of the data is not relevant, or not related to the value of the insight. Okay. I mentioned traffic earlier. In the city of Las Vegas, we get about 20 1516 terabytes per day of the traffic infrastructure. And every intersection, every digital twin, every intersection in the city predicts two minutes into future, okay. And those insights are sold in an API in Azure, to customers like Audi and Uber and Lyft, and whatever else, okay. Now, that's a ton of data, okay, it's just you couldn't even think of where to put in your cloud. But the manual, the inside is relatively low. This is the total amount of money Agni extract from Uber per month per intersection is low. Alright, by the way, all this stuff is open source, you can go grab it, and play and hopefully make your city better. So what from that you can go there, it's not a high enough value for me to do anything other than say, go grab it and run. So vast amounts of data and relatively important, but not commercially relevant value.
Tobias Macey
0:34:35
And another aspect of that case, in particular, is that despite this volume of data, it might be interesting for being able to do historical analyses. But in terms of the actual real world utility, it has a distinct expiration period where you have no real interest in the sensor data as it existed an hour ago, because that has no particular relevance on your current state of the world and what you're trying to do with it at this point in time,
Simon Crosby
0:35:03
yeah, you have historical interest in the sense of wanting to know if your predictions were right, or wanting to know about traffic engineering purposes, which runs on a slower time scale. So some form bucketing or whatever assemble, terse followed, recording is useful. And sure, that easy. But you certainly did not want to record it, there were no dead rate.
Tobias Macey
0:35:30
And then going back to the other question I had earlier, when we were talking about the workflow of an analyst, or a data scientist pushing out their analyses live to these digital twins and potentially having some real world impact. I'm curious if the swim platform has some concept of a dry run mode, where you can deploy this analysis and see what the output of it is without it and see maybe what impact it would have without it actually manifesting in the real world for cases where you want to ensure that you're not accidentally introducing error or potentially having a dangerous outcome, particularly in the case that you were mentioning of an oil and gas rig.
Simon Crosby
0:36:12
Yeah, so I'm with the 1% XE. Everything we've done thus far has been open loop in the sense that we're informing another human or another application, but we're not directly controlling the structure. and the value of a dry run would be enormous, you can imagine in those scenarios, but thus far, we don't have any use cases that we can report of using some for direct control. We do have use cases where on a second by second basis, we are predicting whether machines are going to make an error they make as they build PCBs, for servers, and so on. Then again, what you're doing is you're calling from ladies come over and fix the machine, you're not, you know, you're not trying to change the way the machine bags.
Tobias Macey
0:37:06
And now digging a bit deeper into the actual implementation of swim, I'm wondering if you can talk through how the actual system itself is architected. And some of the ways that it has evolved as you have worked with different partners to deploy it into real world environments and get feedback from them, and how that has affected the overall direction of the product roadmap.
Simon Crosby
0:37:29
So swim is a couple of megabytes of job extensions. Okay? So it's extremely lean, we tend to deploy in containers using the growl VM. To very small, we can run in, you know, probably 100 megabytes or so. And so, people tend to think of when people tend to think of edge, they tend to think of branding in the educated ways or things, we don't really think of Ag that way. And so an important part of defining edge, as far as we're concerned, is simply gaining access to streaming data, we don't really care where it is, but to me small enough to get on limited amounts of compute towards the physical edge. And the, you know, the product has evolved in the sense that, Originally, it was a way of building applications for the edge and you'd sit down, write them in Java, and so on.
0:38:34
laterally, this ability to simply let
0:38:39
let the app application data or let the data build the app, or most of the app can bonus in response
0:38:46
to customer needs.
0:38:49
But swim is deployed, typically in containers, and for that we have in the current release relied very heavy on the Azure IoT edge framework. And that is magical, to be quite honest, because we can rely on Mac soft machinery to deal with all of the painful bits of deployment and lifecycle management for the code base and the application as it runs. These are not things we are really focused on what we're trying to do is build a capability which will respond to data and do the right thing for the application developer. And so we are fully published in the Azure IoT Hub, and you can download this and get going and managers through a cycle that way. And so in several use cases, now, what we're doing is we are use to feed fast time skill, insights at the physical edge, we are labeling data and then dropping it into Azure AD pls, Gen two, and feeding insights into applications built in Power BI. Okay, so it just for the sake of machinery, you know, using the Azure framework for management of the IoT edge, by the way, I think IoT edge is too bad, the worst possible name you could ever pick, because all you want is a thing to manage the lifecycle management of a capability, which is going to deal with fast data. Whether it's at the physical edge or not, is immaterial. But that, but that's basically what we've been doing is relying on Microsoft's fabulous Lifecycle Management Framework for that, plugged into the IoT Hub, and all the Azure IoT small as your services generally, for back end things which enterprises love.
Tobias Macey
0:41:00
Then another element of what we're discussing in the use case, examples that you were describing, particularly, for instance, with the traffic intersections, is the idea of discover ability and routing between these digital twins, as far as how they identify the cardinality of which twins are useful to communicate with and establishing those links, and also at the networking layer, how they handle network failures in terms of communication and ensuring that if there is some sort of fault that they're able to recover from it,
Simon Crosby
0:41:36
symbols, let's talk about two layers. One is the app layer. And the other one is the infrastructure, which is going to run this effective is distributed graph.
0:41:45
And so assume is going to build this graph for us
0:41:49
from the data. What that means is the digital twin, by the way, we technically call these web agents, these little web agents are going to be distributed somewhere a fabric of physical instances, and they may be widely geographically
0:42:06
distributed. And
0:42:08
so there is a need, nonetheless, at the application layer for things which are related in some way linked physically or, you know, in some other way to be able to link to each other that says to
0:42:23
me couldn't have a sub. And so links
0:42:27
require that object, which are the digital twins have the ability to inspect
0:42:33
each other's data,
0:42:34
right, their members, and of course, is something is running on the other side of the planet, and you're linked to it, how on earth is that going to work. So we're all familiar with object oriented languages and objects in one address space, that's pretty easy. We know what an
0:42:50
object handle or an object
0:42:51
reference or a pointer or whatever we get it, but when these things distribute, that's hot, and so in swim with your an application program, where you will simply use object references, but these resolve to your eyes. So in practice, at runtime, the linking is when I link to you, I'll link to your eye. And that link,
0:43:17
will it's resolved by swim
0:43:19
enables a continuous stream of updates to flow from you to me. And if we happen to be on different instances that is running in different address spaces, then there will be over a mash of all my direct web sockets connection between your instance in mind. And so in any swim deployment, all instances are interlinked. So each link to each other using a single web sockets connection, and then these links permit the flow of information between linked digital twins. And what happens is, whenever in a change in the in memory, state of a linked, you know, digital twin happens, what happens is that it's instance, then streams to every other linked object and update to the state for that thing, right. So what are quite what's required is, in effect, a streaming update to Jason, but because of, we're going to record our model in some form of like JSON state or whatever, we would not need to be able to update little bits of it as things change until we use a protocol called warp for that. And that's a swim capability, which we've open sourced. And what that really does is bring streaming to Jason right, streaming updates two parts of a Jason model. And then every instance in swim maintains its own view of the whole model. So as things streaming, the local view of the model is change. But the view of the of the via the world is very much one of a consistency model based on whatever happens to be executing locally and whatever needs to view state certain, eventually consistent dare model, which every node eventually learns the entire thing.
Tobias Macey
0:45:22
And generally, eventually here means you know, a link, so a link away from real time, right, so links delay away from real time. And then the other aspect of the platform is the state fullness of the computation. And as you're saying that that state is eventually consistent dependent on the communication delay between the different nodes within the context graph. And then in terms of data durability, one thing I'm curious of is the length of state, or sort of the overall buffer that is available, I'm guessing is largely dependent on where it happens to be deployed, and what the physical capabilities are of the particular node. And then also, as far as persisting that data for maybe historical analysis, my guess is that that relies on distributing the data to some other system for long term storage. I'm just wondering what the overall sort of pattern or paradigm is for people who want to be able to have that capability?
Simon Crosby
0:46:24
Oh, this is a great question. So in general, the move going from some horrific raw data form on the wire from this the original physical thing to you know, something much more efficient and meaningful in memory, and generally much more concise, so we get a whole ton of dead redaction am I. And so the system is focused on streaming, we don't stop you storing your original data, if you want to, you might just have to discover or whatever the key thing into them is, we don't do that on the hard path. Okay, so things change this day to memory, and maybe compute on that. And that's what they do first and foremost, and then we lately throw things to disk, because disks happens slowly relative to compute. And so typically, what we end up storing is the semantic state of the context graph, as you put it, not the original data.
0:47:23
That is, for example, in traffic world,
0:47:26
you know, we store things like this slide turn red at this particular time, not the voltage on all the registers in the light, and to get massive data reduction. And that form of data is very amenable to storage in the cloud, say or somewhere else. And it's even affordable at, at reasonable rates.
0:47:50
So the key thing for for swimming storage is
0:47:53
you're going to remember as much as you want as much as you have space for locally. And then storage in general, is on the is not on a hot pot, it's not on the computer and string bar and January beginning huge data reductions for every step up the graph we make. So for example, if I go from you know, all the states have all the traffic centers to predictions, then I've made a very substantial reduction in the data remand anyway, right. So as you move up this computational graph, you reduce the amount of data you're going to have to store. And it's up to you really pick what you want to what you want. So
Tobias Macey
0:48:39
in terms of your overall experience, working as the CTO of this organization and shepherding the product direction and the capabilities of this system, I'm wondering what you have found to be some of the most challenging aspects, both from the technical and business sides, and some of the most useful or interesting or unexpected lessons that you've learned in the process.
Simon Crosby
0:49:03
So what's hard is that the real world is not the cloud native world. So we've all seen tablets, examples of Netflix, and Amazon and everybody else doing cool things with data they do. But you know, if you're an oil company, and you have a regarded See, you just don't know how to do this. So, you know, we can come at this, with whatever skill sets we have, what we find is that the real world large enterprises have today are still acres behind the cloud native folk. And that's a challenge. Okay, so getting to be able to understand what they need, because they still have lots of assets, which is generating tons of data is very hard. Second, this notion of edge is continually confusing. And I mentioned previously that, that I would never I've chosen IOTHS, for example, that as your name, because it's not about IoT, or maybe it is, but you may give you two examples. One is traffic lights, say physical things, it's pretty obvious that you're, what the notion of edge is its physical edge. But the other one is this, we build a real time model for millions 10s of millions of headsets for a large mobile carrier in memory, and devolve all the time, right in response to continue to receive signals from these devices,
0:50:38
there is no age,
0:50:40
that is its data and drives over the internet. And we have to figure out where the digital twin for that thing is, and evolve it in real time. Okay, and there, you know, there is no concept of of a of a network to be no or physical edge and traveling over them. We just have to make decisions on the fly and learn and update this month.
0:51:06
So for me, edges, the following thing, edge is stable.
0:51:13
And
0:51:15
cloud is all about rest. Okay, so what I'd say is the fundamental difference between the notion of edge and the notion of cloud that I would like to see, broadly understood is that Where's rest and databases made the cloud very successful, in order to be successful with, you know, this boundless streaming data, state fullness is fundamental, which means rest goes up the door. And we have to move to a model, which is streaming based and staple computation.
Tobias Macey
0:51:50
And then in terms of the future direction, both from the technical and business perspective, I'm wondering what you have planned for both the enterprise product for swim.ai, as well as the open source kernel in the form of CMOS.
Simon Crosby
0:52:06
From an open source perspective, we,
0:52:08
you know, we don't have the advantage of having come up at LinkedIn or something we built it built in us at scale, and be coming out of the startup? Well, we think we built is something which is a phenomenal value. And we're seeing that grow. And our intention is to continually feed their community as much as you can take. And we're just getting more and more stuff ready for open sourcing and ending up.
0:52:36
So we want to see our community
0:52:40
go and explore new use cases for using this stuff, and are totally dedicated to empowering our community. From a commercial perspective, we are focused on honor world, which is edge and moment you said people they tend to get an idea physical edge or something in their heads. And then you know, very quickly, you can get put in a bucket of IoT, I gave an example of say, building a model in real time in AWS for you know, a mobile customer, our intention is to continue to push the bounds of what edge means and and to enable people to build stream pipelines for massive amounts of data easily without complexity, and without the skill set required to invest in these traditionally, fairly heavyweight pipeline components such as beam and flank and, and so on,
0:53:46
to
0:53:47
to enable people to get insights cheaply, and to make the problem of dealing
0:53:51
with new insights from data very easy to solve.
Tobias Macey
0:53:56
And are there any other aspects of your work on swimming is the space of streaming data and digital twins that we didn't discuss yet that you'd like to cover? Before we close out the show?
Simon Crosby
0:54:08
I think we've done pretty good job, you know, I think there are a bunch of parallel efforts. And that's all goodness, that is one of the hardest things has been to get this notion of stapling this more broadly accepted. And I see the function like vendor out there pushing their idea, this a staple functions as service. And really, these are staple amateurs. And there are others out there too. So for me, step number one is to get people to realize that if we're going to take this data that rest and databases are going to kill us, okay? That is there is so much data and the rates are so high that you simply cannot afford to use a stateless paradigm for processing you have to do stay fully. Because, you know, forgetting context every time and then look it up. It's just too expensive.
Tobias Macey
0:55:08
For anybody who wants to follow along with you and get in touch and keeping track of what you're up to. I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today?
Simon Crosby
0:55:26
Well, I think, I mean, there isn't much tooling to be perfect out there a bunch of really fabulous open source code bases and experts in their use. But that's far from tooling. And then there is I guess, an extension of the Power BI downwards. Were, which is something like the monster Excel spreadsheet world, right? So you find all these folks who are pushing that kind of you no end user model of data, doing great things, but leaving a huge gap between the consumer of the insight and the data itself is assuming the data is already there in some good form and can be put into spiritual view, whatever it happens to be. So there's this huge gap in the middle, which is how do we build the model? What does the model tell us? Just off the bat, how do we do this reconstructive Lee in large numbers situations? And then how do we dynamically insert operators which are going to compute useful things for us on the fly in writing models?
Tobias Macey
0:56:44
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on the swim platform. It's definitely a very interesting approach to data management and analytics, and I look forward to seeing the direction that you take it in the future. So I appreciate your time on that. I hope you enjoy the rest of your day.
Simon Crosby
0:57:01
Thanks very much. You've been great be
Tobias Macey
0:57:03
Thank you for listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

Building A Reliable And Performant Router For Observability Data - Episode 97

Summary

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what the Vector project is and your reason for creating it?
    • What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project?
  • What strategy are you using for project governance and sustainability?
  • What are the main use cases that Vector enables?
  • Can you explain how Vector is implemented and how the system design has evolved since you began working on it?
    • How did your experience building the business and products for Timber influence and inform your work on Vector?
    • When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust?
    • What led you to choose Lua as the embedded scripting environment?
  • What data format does Vector use internally?
    • Is there any support for defining and enforcing schemas?
      • In the event of a malformed message is there any capacity for a dead letter queue?
  • What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data?
  • When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations?
  • What options are available to operators to support visibility into the running system?
  • In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy?
  • What are some of the other considerations that operators and administrators of Vector should be considering?
  • You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap?
  • What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink)
  • What are some of the most interesting/innovative/unexpected ways that you have seen Vector used?
  • What are some of the challenges that you have faced in building/publicizing Vector?
  • For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently?
    • What is missing that you would consider necessary for production readiness?
  • When is Vector the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Lynn ODE. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Ben Johnson and Luke Steenson about vector, a high performance open source observe ability data router. So Ben, can you start by introducing yourself?
Ben Johnson
0:01:47
Sure. My name is Ben. I am the co founder CTO temper IO.
Tobias Macey
0:01:53
And Luke, How about yourself?
Luke Steensen
0:01:54
Yeah. I'm Luke Steenson. I'm an engineer at timber.
Tobias Macey
0:01:58
And Ben, going back to you. Do you remember how you first got involved in the area of data management?
Ben Johnson
0:02:02
Yeah. So I mean, just being an engineer, obviously, you get involved in it through observing your systems. And so before we started timber, I was an engineer at sea gig, we dealt with all kinds of observe ability challenges there.
Tobias Macey
0:02:16
And Luke, do you remember how you first got involved in the area of data management?
Luke Steensen
0:02:20
Yeah, so at my last job, I ended up working with Kafka quite a bit in a in a few different contexts. So I ended up getting getting pretty involved with that project, leading some of our internal Stream Processing projects that we were trying to get off the ground, and I just found it, you know, it's a very interesting space. And the more that you dig into a lot of different engineering problems it does, it ends up boiling down to to managing your data, especially when you have a lot of it, it kind of becomes the the primary challenge, and limits a lot of what you're able to do. So kind of the more tools and techniques you you have to address those issues and use those kind of design tools, the further you can get, I think,
Tobias Macey
0:03:09
and so you both [email protected], and you have begun work on this vector project. So I'm wondering if you can explain a bit about what it is and the overall reason that you had for creating it in the first place? Yeah, sure.
Ben Johnson
0:03:21
So on this on the most basic sounds of vectors, and observable the data router and collects data from anywhere in your infrastructure, whether that be a log file over TCP socket, and can be stats, D metrics, and then vector is designed to ingest that data and then send it to multiple sources. And so the idea is that it is sort of vendor agnostic and collects data from many sources and sends it to many things. And the reason we created it was really, for a number of reasons, I would say, one, you know, being an observer ability company, and when we initially launched number, it was a hosted a blogging platform. And we needed a way to collect our customers data, we tried writing our own, initially and go that was very just kind of specific to our platform. That was that was very difficult. We started using off the shelf solutions, and found those also be difficult, we were getting a lot of support requests, it was hard for us to contribute and debug them. And then I think in general, you know, our our ethos as a company is we want to create a world where developers have choice and are locked into specific technologies are able to move with the times choose best in class tools for the job. And that's kind of what prompted us to start vectors. That vision, I think, is enabled by an open collector that is vendor agnostic, and meets a quality standard that makes people want to use it. So it looks like we have other areas in this podcast where we'll get into some of the details there. But we really wanted to raise the bar on the open collectors and start to give control and ownership back to the people, the developers that were deploying vector.
Tobias Macey
0:05:14
And as you mentioned, there are a number of other off the shelf solutions that are available. Personally, I've had a lot of experience with fluent D. And I know that there are other systems coming from the elastic stack and other areas. I'm curious, what are some of the tools that you consider to be comparable or operating in the same space? And any of the ones that you've had experience with that you found to be lacking? And what were the particular areas that you felt needed to be addressed that weren't being handled by those other solutions?
Ben Johnson
0:05:45
Yeah, I think that's a really good question. So first, I would probably classify the collectors as either open or not. And so typically, I wouldn't, we're not too concerned with vendor specific collectors, like the spawn corridor, or anything another sort of, you know, thing that just ships data, one service. So I'd say that, you know, in the category of just comparing tools, I'll focus on the ones that are open, like you said, flinty filebeat LogStash, like, I think it's questionable that they're completely open. But I think we're more comparable to those tools. And then I'd also say that, like, we're, we typically try to stay away from like, I don't want to say anything negative about the projects, because I, a lot of them were pieces of inspiration for us. And so, you know, we respect the fact that they are open, and they were solving a problem at the time. But I'd say one of the projects that that really, we thought was a great alternative and inspired us is one called Cernan. It was built by Postmates. It's also written and rest. And that kind of opened our eyes a little bit like a new bar, a new standard that you could set with these, these collectors. And we actually know Brian Trautwein, he was one of the developers that worked on it. He's been really friendly and helpful to us. But the sort of thing that the reason we didn't use certain is like one, it's, it was created out of necessity of Postmates. And it doesn't seem to be actively maintained. And so that's one of the big reasons we started vector. And so I would say that's, that's something that's lacking is like, you know, a project that is actively maintained and is in it for the long term. Obviously, that's, that's important. And then in terms of like actual specifics of these projects. There's a lot that I could say here. But you know, on one hand, we've seen a trend of certain tools that are built for a very specific storage, and then sort of like backed into supporting more sinks. And it seems like the like incentives and sort of fundamental practices of those tools are not aligned with many disparate storage is that kind of ingest data differently, for example, like the fundamentals of like badging and stream processing, I think thinking about those two ways of like, collecting data and sending it downstream kind of don't work for every single storage that you want to support. The other thing is just the obvious ones like performance, reliability, having no dependencies, you know, if you're not a strong Java shop, you probably aren't comfortable deploying something like LogStash and then managing the JVM and everything associated with that. And, yeah, I think another thing is we want to collector that was like, fully vendor agnostic and vendor neutral. And a lot of them don't necessarily fall into that bucket. And as I said before, like that's something we really strongly believe in as an observer ability world where developers can rely on a best in class tool that is not biased and has aligns incentives with the people using it, because there's just so many benefits that stem off of that.
Tobias Macey
0:08:51
And on the point of sustainability, and openness, I'm curious, since you are part of a company, and this is and some ways related to the product offering that you have how you're approaching issues, such as product governance and sustainability and ensuring that the overall direction of the project is remaining impartial and open and trying to foster a community around it so that it's not entirely reliant on the direction that you try to take it internally and that you're incorporating input from other people who are trying to use it for their specific use cases.
Ben Johnson
0:09:28
Yeah, I think that's a great question.
0:09:31
So one is we want to be totally transparent on everything, like everything we do with vector discussions, specifications, roadmap planning, it's all available on GitHub. And so nothing is private there. And we want factor to truly be an open project that anyone can contribute to. And then, in terms of like governance and sustainability, like we try to do a really good job. just maintaining the project. So number one is like good as you management, like making sure that that's, that's done properly, helps people like search for issues helps them understand like, which issues need help, like what are good first issues to start contributing on. We wrote an entire contribution guide and actually spent good time and put some serious thought into that so that people understand like, what are the principles of vector and like, how do you get started. And then I think the other thing that really sets vector apart is like the documentation. And I think that's actually very important for sustainability. And helping to it's really kind of like a reflection of your projects, respect for the users in a way. But it also serves as a really good opportunity to like explain the project and help people understand like the internals the project, and how to how to contribute to it. So really kind of all comes together. But I'd say yeah, the number one thing is just transparency, and making sure everything we do is out in the open.
Tobias Macey
0:11:00
And then in terms of the use cases, that vector enables, obviously, one of them is just being able to process logs from a source to a destination. But in the broader sense, what are some of the ways that vector is being used both at timber and with other users and organizations that have started experimenting with it beyond just the simple case.
Ben Johnson
0:11:20
So first, like vectors, news are, we're still learning a lot as we go. But, you know, the core use cases, the business use cases we see is there's everything from reducing cost vectors quite a bit more efficient, the most collectors out there. So just by deploying vector, you're going to be using less CPU cycles, less memory, and you'll have more of that available for the app that's running on that server, how side of that it's like the fact that vector enables choosing multiple storage is and the storage that is best for your use case, lets you reduce costs as well. So for example, you know, like, if you're running an elk stack, you don't necessarily want to use your, before archiving, you can use another cheaper, durable storage for that purpose and sort of take the responsibility out of your elk stack. And that reduces costs in that way. So I think that's like an interesting way to use vector. Another one is, like I said before, reducing lock in that use cases is so powerful, because it gives you the agility, choice control, ownership of your data. Transitioning vendors is a big use case, we've seen so many companies we talked to or bogged down and locked in to vendors, and they're tired of paying the bill, but they don't see a clean way out. And like observer abilities. And it is an interesting problem. Because it's not just technology coupling, like there are human workflows that are coupled with the systems you're using. And so transitioning out of something that maybe isn't working for your organization anymore, requires a bridge. And so a vector is a really great way to do that, like deploy vector, continue sending to whatever vendor you're using. And then you can, at the same time start to try out other stages. And like other setups without disrupting, like the human workflows in your organization, and I can keep going, there's data governance, we've seen people you know, cleaning up their data and forcing schemas, security and compliance, you have the ability to like scrub sensitive data at the source before it even goes downstream. And so you know, again, like having a good open tool like this is so incredibly powerful, because of all of those use cases that it enables. And, like, lets you take advantage of those when you're ready.
Tobias Macey
0:13:36
In terms of the actual implementation of the project, you've already mentioned in passing that it was written in rust. And I'm wondering if you can dig into the overall system architecture and implementation of the project and some of the ways that it has evolved since you first began working on it, like you said, rust is,
Luke Steensen
0:13:53
I mean, that's kind of first thing everybody looks at certain interest.
0:13:57
And kind of on top of that, we're we're building with the, like the Tokyo asynchronous i O, kind of stack of, of libraries and tools within the rust ecosystem. Kind of from the beginning, we we've started vector, pretty simple, architecturally. And we're kind of we have an eye on, on where we'd like to be. But we're trying to get there very, very incrementally. So at a high level, each of the internal components of vectors is generally either a source of transform or sink. So, so probably familiar terms, if you if you dealt with this type of tool before, but sources are something that helps you ingest data and transforms, different things you can do like parsing JSON data into, you know, our internal data format, doing regular expression, value extracting, like Ben mentioned, and forcing schemas, all kinds of stuff like that. And then syncs, obviously, which is where we will actually forward that data downstream to some external storage system or service or something like that. So that those are kind of the high level pieces, we have some different patterns around each of those. And obviously, there's different different flavors. You know, if you're, if you have a UDP sis logs source, that's, that's going to look and operate a lot differently than a file tailing source. So there's a lot of those, there's a lot of different styles of implementation, but they all we kind of fit them into those three buckets of source, transform and sink. And then the way that you configure vector, you're, you're basically building a data flow graph, where data comes in through a source, flow through any number of transforms, and then down the graph into a sync or multiple things, we try to keep it as flexible as possible. So you can, you can pretty much build like an arbitrary graph of data flow, obviously, there are going to be situations where that that isn't, you know, you could build something that's this pathological or won't perform well, but we've kind of leaned towards giving users the flexibility to do what you want. So if you want to, you know, parse something as JSON and then use a reg ex to extract something out of one of those fields, and then enforce a schema and drop some fields, you can kind of chain all these things together. And you can, you can kind of have them fan out into different transforms and merge back together into a single sink or feed two sinks from the same transform, output, all that kind of stuff. So basically, we try to keep it very flexible, we definitely don't advertise ourselves as like a general purpose stream processor. But there's a lot of influence from working with those kinds of systems that has found its way into the design of vector.
Tobias Macey
0:17:09
Yeah, the ability to map together different components of the overall flow is definitely useful. And I've been using fluid D for a while, which has some measure of that capability. But it's also somewhat constrained in that the logic of the actual pipeline flow is dependent on the order of specification and the configuration document, which is sometimes a bit difficult to understand exactly how to structure the document to make sure that everything is functioning as properly. And there are some mechanisms for being able to route things slightly out of band with particular syntax, but just managing it has gotten to be somewhat complex. So when I was looking through the documentation for vector, I appreciated the option of being able to simply say that the input to one of the steps is linked to to the ID of one of the previous steps so that you're not necessarily constrained by order of definition, and that you can instead just use the ID references to ensure that the flows are Yeah, that
Luke Steensen
0:18:12
was definitely really something that we spent a lot of time thinking about. And we still spend a lot of time thinking about, because, you know, if you kind of squint at these config files, they're, they're kind of like a program that you're writing, you know, you have data inputs and processing steps and data outputs. So you, you want to make sure that that flow is clear to people. And you also want to make sure that, you know that there aren't going to be any surprises you don't want. I know a lot of tools, like you mentioned, have this as kind of an implicit part of the way the config is written, which can be difficult to manage, we wanted to make it as explicit as possible. But also in a way that is still relatively readable. From a, you know, just when you open up the config file, we've gone with a pretty simple toggle format. And then like you mentioned, you just kind of mentioned, you just kind of specify which input each component should draw from, we have had some kind of ideas and discussions about what our own configuration file format would look like. I mean, we've what we would love to do eventually is make these kind of pipelines as much as as pleasant to write as something like, like a bash pipeline, which we think that's another really powerful inspiration for us. Obviously, they have their limitations. But the things that you can do, just in a bash pipeline with a, you have a log file, you grab things out, you run it through, there's all kinds of really cool stuff that you can do in like a lightweight way. And that's something that we've we've put a little thought into, how can we be as close to that level of like, power and flexibility, while avoiding a lot of the limitations of, you know, obviously, being a single tool on a single machine. And, you know, I don't want to get into all the, the gotchas that come along with writing bash one liners, obviously, there, there are a lot. But if we want something that we want to kind of take as much of the good parts from as possible.
Tobias Macey
0:20:33
And then in terms of your decision process for the actual runtime implementation for both the actual engine itself, as well as the scripting layer that you implemented in Lua? What was the decision process that went into that as far as choosing and settling on rust? And what were the overall considerations and requirements that you had as part of that decision process.
Luke Steensen
0:20:57
So from a high level, the thing that we thought were most important when writing this tool, which, which is obviously going to run on other people's machines, and hopefully run on a lot of other people's machines, we want to be, you know, respectful of the fact that they're, you know, willing to put our tool on a lot of their, their boxes. So we don't want to use a lot of memory, we don't want to use a lot of CPU, we want to be as resource constrained as possible. So So efficiency is a big was a big point for us. Which rust obviously gives you the ability to do, there's you know, I'm a big fan of Russ. So I could probably talk for a long time about all the all the wonderful features and things. But honestly, the fact that it's a it's a tool that lets you write, you know, very efficient programs, control your memory use pretty tightly. That's somewhere that we I have a pretty big advantage over a lot of other tools. And then just I was the first engineer on the project, and I know rust quite well. So just kind of the the human aspect of it, it made sense for us, we're lucky to have a couple people at timber who are who are very, very good with rest very familiar and involved in the community. So it has worked out, I think it's a it's worked out very well from the embedded scripting perspective, Lua was kind of an obvious, obvious first choice for us. There's very good precedent for for using Lua in this manner. For example, in engine x and h a proxy. They both have local environments that lets you do a lot of amazing things that you would maybe never expect to be able to do with those tools, you can write a little bit of Lua. And there you go, you're all set. So lou is very much built for this purpose. It's it's kind of built as an embedded language. And there were, there's a mature implementation of bindings for us. So didn't take a lot of work to integrate Lua and we have a lot of confidence that it's a reasonably performant reliable thing that we can kind of drop in and expect to work. That being said, it's it's definitely not the end all be all. We know that while people can be familiar with Lua from a lot of different areas where it's used, like gaming and our game development. And like I mentioned some observe ability tools or infrastructure tools, we are interested in supporting more than just Lua, we actually have a work in progress, JavaScript transform, that will allow people to kind of use that as an alternative engine for transformations. And then a little bit longer term we this is we kind of want this area to mature a little bit before we dive in. But the was awesome space has been super interesting. And I think that, particularly from a flexibility and performance perspective could give us a platform to do some really interesting things in the future.
Tobias Macey
0:24:06
Yeah, I definitely think that the web assembly area is an interesting space to keep an eye on because of the fact that it is, in some ways being targeted as sort of a universal runtime that multiple different languages can target. And then in terms of your choice of rust, another benefit that it has, when you're discussing the memory efficiency is the guaranteed memory safety, which is certainly important when you're running it in customer environments, because that way, you're less likely to have memory leaks or accidentally crashed their servers because of a bug in your implementation. So I definitely think that that's a good choice as well. And then one other aspect of the choice of rest for the implementation language that I'm curious about is how that has impacted the overall uptake of users who are looking to contribute to the project either because they're interested in learning rust, but also in terms of people who aren't necessarily familiar with the Ruston any barriers that that may pose,
Luke Steensen
0:25:02
it's something that's kind of hard, it's hard to know, because obviously we can we didn't can't inspect the alternate timeline where we we wrote it and go or something like that, I would say that there's kind of there's there's ups and downs from a lot of different perspectives from like a from an developer interest perspective, I think rust is something that a lot of people find interesting. Now the, the sales pitch is a good one, and a lot of people find it compelling. So I think it's definitely, you know, it's caught a few more people's interest because it happens to be written in rust, we, we try not to push on that too hard, because of course, there's, there's the other set of people who, who do not like rust and are very tired of hearing about it. So, you know, we love it, and we're very happy with it. But we try not to make it, you know, a primary marketing point or anything like that. But I think it does, it does help to some degree. And then from a contributions perspective, again, it's hard to say for sure, but I do know from experience that we have had, you know, a handful of people kind of pop up from the open source community and give us some some really high quality contributions. And we've been really happy with that, like I said, we can't really know how that would compare to, if we had written it in a language that more people are proficient in. But the contributions from the community that we have seen so far have been, like I said, really high quality, and we're really happy with it, the the JavaScript transform that I mentioned, is actually something that's a good example of that we had a contributor come in and do a ton of really great work to, to make that a reality. And it's something that we're pretty close to being able to merge and ship. So that's something that I definitely shared a little bit of that concern, I was like, I know, rust, at least has the reputation of being a more difficult to learn language. But at the the community is there, there's a lot of really skilled developers that are interested in rust and you know, would love to have an open source project like vector that they can contribute to. And we've seen,
Tobias Macey
0:27:12
we've definitely seen a lot of benefit from that, in terms of the internals of vector, I'm curious how the data itself is represented once it is ingested in the sink, and how you process it through the transforms, as far as if there's a particular data format that you use internally in memory, and also any capability for schema enforcement as it's being flowed through vector out to the sinks.
Luke Steensen
0:27:39
Yeah, so right now we have our own internal our own in memory, data format, it's kind of it's a little bit, I don't want to say thrown together. But it's something that's been incrementally evolving pretty rapidly as we build up the number of different sources and things that we support. This was actually something that we deliberately kind of set out to do when we were building vectors, we didn't want to start with the data model. You know, there are some projects that do that. And that's, I think, there's definitely a space for that. But the data modeling and the observe ability, space is, is not always the best. But we explicitly kind of wanted to leave that other people. And we were going to start with the simplest possible thing. And then kind of add features up as we found that we, we needed them in order to better support the data models of the downstream sinks and the transforms that we want it to be able to do. So from from day one, that the data model was basically just string, you know, you send us a log message, and we represented as a as a string of characters. Obviously, it's grown a lot since then. But we basically now support we call them events internally, it's kind of our, our vague name for everything that flows through the system. events can be a log, or they can be a metric through a metric, we support a number of different types, including counters, gauges, kind of all your standard types of metrics from like the stats, D family of tools, and then logs, that can be just a message. Like I said, just a string, we still support that as much as we ever have. But we also support more structured data. So right now, it's a flat map of string, you know, a map of string to something, we have a variety of different types that the values can be. And that's also something that's kind of growing as we want to better support different tools. So right now, it's kind of like non nested JSON ish representation. In memory, we don't actually see realize it to JSON, we support a few extra types, like timestamps, and things like that, that are important for our use case. But in general, that's kind of how you can think about it, we have, we have a protocol buffers schema for that data format that we use when we serialized to disk when we do some of our on disk buffering. But that is I wouldn't that's necessarily the primary representation, we when you work with it in a in a transform your, you're looking at that, that in memory representation that like I said, kind of looks a lot like JSON. And that's something that we're we're kind of constantly reevaluating and thinking about how we want to evolve. I think kind of the next, the next step in that evolution is to make it not necessarily just a flattened map and move it towards supporting like, nested maps, nested keys. So it's going to move more towards like an actual, you know, full JSON, with better types and support for things like that.
Tobias Macey
0:30:39
And on the reliability front, you mentioned briefly the idea of disk buffering. And that's definitely something that is necessary for the case where you need to restart the service and you don't want to lose messages that have been sent to an aggregator node, for instance, I'm curious, what are some of the overall capabilities in vector the support this reliability objective, and also, in terms of things such as malformed messages, if you're trying to enforce the schema, if there's any way of putting those into dead letter Q for reprocessing, or anything along those lines?
Luke Steensen
0:31:14
Yeah, dead letter Q specifically, isn't something that we support at the moment. That's it is something that we've been thinking about, and we do want to come up with a good way to support that. But currently, that isn't something that we have a lot of transforms like the the schema enforcement transform will end up just just dropping the events that don't, or it will, if it can't enforce that they do meet the schema by dropping fields, for example, it will, it will just drop the event, which, you know, we're we recognize the the shortcomings there. I think one of the one of the things that is a little bit nice from an implementation perspective about working in the observe ability space, as opposed to, you know, the normal data streaming world with application data, is that people can be a little bit more, there's more of an expectation of best effort, which is something that we're willing to take advantage of a little bit in, like the very beginning early stages of a certain feature or tool, but it but it's definitely a part of the part of the ecosystem that we want to push forward. So it's, that's something that we we try to keep in mind as we build all this stuff is yes, it might be okay. Now, we may have parody with other tools, for example, if we just got messages that don't need a certain schema, but you know, how can we how can we do better than that other tools that are other kind of things in the toolbox that you can reach for for this type of thing, or, I mean, the most basic one would be that you can send data to multiple things. So if you have a kind of classic sis log like setup, where you're forwarding logs around, and it's super, super simple to just add, secondary, that will forward to both CES log aggregator a and CES log aggregator be. That's, that's, that's nothing particularly groundbreaking, but it's something that is kind of the start beyond that. I mentioned, the the disk buffer where we want to make do it as good a job as we can, ensuring that we don't lose your data, once you have sent it to us, we are we are still a single node tool at this point where we're not a distributed storage system. So there are going to be some inherent limitations in in the guarantees that we can provide you there, we do recommend, you know, if you if you really want to make sure that you're not losing any data at all vector is going to is not going to be able to give you the guarantees that something like Kafka would. So we want to make sure that we work well with tools like Kafka, that are going to give you pretty solid, you know, redundant reliable distributed storage guarantees. Let's see, other than those two, we writing the tool in rust is, you know, kind of an indirect way that we want to try to make it just as reliable as possible. I think rust has a little bit of a reputation for making it tricky to do things, you know that the compiler is very picky and wants to make sure that everything you're doing is safe. And that's something that you can you definitely take advantage of to kind of guide you in writing, you mentioned, like memory safe code, but it kind of expands beyond that into ensuring that every error that pops up, you're going to have your handling explicitly at that level or a level above and things like that it kind of guides you into writing more reliable code by default, obviously, it's still on you to make sure that you're covering all the cases and things like that, but it definitely helps. And then moving forward, we're we're going to spend a lot of time and the very near future, setting up certain kind of internal torture environments, if you will, where we can run vector for long periods of time and kind of induce certain failures in the network and you know, the upstream services, maybe delete some data from underneath it on disk and that kind of thing. kind of familiar, if you're familiar with the the Jepsen suite of database testing tools, obviously, we don't have quite the same types of invariance that an actual database would have. But I think we can use a lot of those techniques to kind of stress vector and see how it responds. And like I said, we're going to be limited and what we can do based off of the fact that we're a single node system. And you know, if you're sending us data over UDP, there's not a ton of guarantees that we're going to be able to give you. But to the degree that we're able to give guarantees, we very much would like to do that. So that's that's definitely is a focus of ours, we're going to be exploring that as much as possible.
Tobias Macey
0:36:03
And then, in terms of the deployment, apologies that are available, you mentioned one situation where you're forwarding to a Kafka topic. But I'm curious what other options there are for ensuring high availability, and just the overall uptime of the system for being able to deliver messages or events or data from the source to the various destinations.
Luke Steensen
0:36:28
Yeah, there are a few different kind of general topology patterns that we you know, we've documented and we we recommend to people, I mean, the simplest one, depending on how your infrastructure setup can just be to run vector on each of your, you know, application servers, or whatever it is that you have. And kind of run them there in a very distributed manner. And forward to, you know, if you are sending it to a certain upstream logging service or, or something like that, you can kind of do that where it's, you don't necessarily have any aggregation happening in your infrastructure. That's pretty easy to get started with. But it does have limitations. If you know, if you don't want to allow outbound internet access, for example, from from each of your nodes, the next kind of step, like you mentioned is, you know, we would support a second kind of tier of vector running maybe on a dedicated box, and you would have a number of nodes forward to this more centralized aggregator node, and then that node would forward to whatever other you know, sinks that you have in mind, that's kind of the second level of complexity, I would say, you do get some benefits in that you have some more power to do things like aggregations and sampling in a centralized manner, there's going to be certain things that you can't necessarily do if you never bring the data together. And you can do that, especially if you're looking to reduce cost, it's nice to be able to have that that aggregator node kind of as a as a leverage point where you can bring everything together, evaluate what is, you know, most important for you to forward to different places, and do that there. And then kind of the, for people who want more reliability than a, you know, a single aggregation node at this point, our recommendation is something like Kafka that that's going to give you distributed durable storage, we that that is a big jump in terms of infrastructure complexity. So there's definitely room for some in betweens there that we're working on in terms of, you know, having a fail over option. Like right now, you could put a couple aggregator knows bomb behind a TCP load balancer or something like that, that's not necessarily going to be the best experience. So we're kind of investigating our options there to try to give people a good range of choices for you know, how much they're willing to invest in the infrastructure, and what kind of reliability and robustness benefits that they
Tobias Macey
0:39:19
that they need. Another aspect of the operational characteristics of the system are being able to have visibility into particularly at the aggregate or level, what the current status is of the buffering or any errors that are cropping up, and just the overall system capacity. And I'm curious if there's any current capability for that, or what the future plans are along those lines.
Luke Steensen
0:39:44
Um, yeah, we have some we have a setup for for kind of our own internal metrics at this point, that that is another thing that we're kind of alongside the liabilities, stuff that you mentioned that that we're really looking at very closely right now. And what what we want to do next, we've kind of the way we've set ourselves up, we have kind of an event based system internally, where it can be emitted normally as log events, but we also have the means to essentially send them through something like, like a vector pipeline, where we can do aggregations, and kind of filter and sample and do that kind of stuff to get better insight into what's happening in the process. So we haven't gotten as far as I'd like down that road at this point. But I think we have a pretty solid foundation to do some some interesting things. And and it's going to be definitely a point of focus in the next, you know, few weeks.
Tobias Macey
0:40:50
So in terms of the overall roadmap, you've got a fairly detailed set of features and capabilities that you're looking to implement. And I'm wondering what your decision process was in terms of the priority ordering of those features, and how you identified what the necessary set was for a one dot o release.
Ben Johnson
0:41:12
So initially, when we planned out the project, you know, our roadmap was largely influenced by our past experiences, you know, not only supporting timber customers, but running around observer building tools. And just based on the previous questions you asked, was obvious to us that we would need to support those different type of deployment models. And so a lot of it's a part of the roadmap was dictated by that. So you can see, like, later on the roadmap, we want to support stream processors, so we can enable that sort of deployment topology. And, yeah, it was kind of, it's very much evolving, though, as we learn and kind of collect data from customers and their use cases, we're actually are going to make some changes to it. But and in terms of a 1.0 release, like everything that you see, and the roadmap on GitHub, which are represented as milestones, we think that sort of represents, like a 1.0 release for us represents something a reasonably sized company could deploy and rely on vector. And so, you know, again, given our experience, a lot of that is dependent on Kafka, or some sort of some sort of more complex topology, as it relates to collecting your data and routing it downstream.
Tobias Macey
0:42:38
And then, in terms of the current state of the system, how would you characterize the overall production readiness of it, and whatever, and any features that are currently missing that you think would be necessary for a medium to large scale company to be able to adopt it readily?
Ben Johnson
0:42:55
Yeah. So in terms of like a one point release, where we would recommend it to for like, very stringent production use cases. I think what Luke just talked about internal metrics, I think it's really important that we improve vectors own internal observer ability, and provide operators the tools necessary to monitor performance, set up alarms and make sure that they have confidence in factor internally, the stress testing is also something that would raise our confidence, and that we have a lot of interesting stress testing use cases that we want to run vector through. And I think that'll expose some problems. But I think getting that done would definitely raise our confidence. And then I think there's just some like, General house cleanup that I think would be helpful for one point or release. Like, you know, the initial stages of this project have been inherently a little more messy, because we are building out the foundation and moving pretty quickly with our integrations. I would like to see that settle down more when we get to 1.0. So that we have smaller increment releases, and we take breaking changes incredibly seriously factors reliability, and sort of least surprise, philosophy definitely plays into like how we're releasing the software and making sure that we aren't releasing a minor update that actually has breaking changes in it, for example. So I would say those are the main things missing. Before we can officially call it one point O outside of that, the one other thing that we want to do is provide more education on some high level use cases around vector. I think right now, it's like the documentation is is very good. And that it, like dives deep into all the different components like sources, sinks and transforms and all the options available. But I think we're lacking in more guidance around like how you deploy vector and an AWS environment or a GCC environment. And that's, that's certainly not needed for 1.0. But I think it is one of the big missing pieces that will make Dr. More of a joy to us
Tobias Macey
0:44:55
in terms of the integrations, what are some of the ways that people can add new capabilities to the system? Does it require being compiled into the static binary? Or are there other integration points where somebody can add a plug in. And then also, in terms of just use of the system, I'm curious what options there are as far as being able to test out a configuration to make sure that the content flow is what you're expecting.
Luke Steensen
0:45:21
So in terms of plugins, basically, that's the we don't have a strong concept of that right now, all of the transforms that I've mentioned, sources and sinks are all written in rust and kind of natively compiled into the system that has a lot of benefits, obviously, in terms of performance, and we get to make sure that everything fits in perfectly ahead of time. But obviously, it's it's not quite as extensible as we'd like at that point. So there, there are a number of strategies that we've we've thought about for allowing kind of more user provided plugins, I know, I know, that's a big thing feature of fluent D that people get a lot of use out of. So it is something that we'd like to support. But we want to be careful how we do it. Because, you know, we don't want to give up our core strength necessarily, which I'd say, you know, the kind of the performance and robustness reliability of the system, we want to be careful how we expose those extension points to kind of make sure that the system as a whole maintains those properties that that we find most valuable. So
Ben Johnson
0:46:29
yeah, and that's to echo Lake on that, like we've seen, you know, plugin plugin ecosystems are incredibly valuable, but they can be very dangerous, like they can ruin a projects reputation as it relates to reliability and performance. And we've seen that firsthand with a lot of the different interesting fluidity setups that we've seen with our customers, they'll use off the shelf plugins that aren't necessarily written properly or maintained actively, and it just implements, it adds this variable to just the discussion of running vector that makes it very hard to ensure that it's meeting the reliability and performance standards that we want to meet. And so I would say that if we do introduce the plugin system, there will be it'll be quite a bit different than I think what people are expecting. That's something that we're taking, we're putting a lot of thought into. And, you know, to go back to some of the things you said before, it's like we've had community contributions, and they're very high quality. But those still go through a code review process that exposes quite a bit of quite a bit of like, fundamental differences and issues in the code that would have otherwise not been taught. And so it's, it's an interesting kind of like conundrum to be in is like I, on the one hand, we like that process, because it ensures quality on the other it is a blocker and certain use cases.
Luke Steensen
0:47:48
Yeah, I think our strategy there so far has basically been to allow program ability in limited places, for example, the Lua transform and the kind of upcoming JavaScript transform, there's got kind of a surprising amount that you can do, even when you're limited to that, to that context of a single transformation, we are interested in kind of extending that to say, you know, is it would it make sense to have a sink or a source that you could write a lot of the logic in, in something like Lua or JavaScript or, you know, a language compiled to web assembly. And then we provide almost like a standard library of you know, io functions and things like that, that would we would have more control over and could do a little bit more to ensure, like Ben said, the performance and reliability of the system. And then kind of the final thing is we really want vector to be as as easy to contribute to as possible. Ben mentioned, some, you know, housekeeping things that we want to want to do before we really considered at 1.0. And I think a lot of that is around extracting common patterns for things like sources things and transforms into to kind of our internal library so that if you want to come in and contribute support to vector for a new downstream database, or metric service or something like that, we want that process to be as simple as possible. And we want you to kind of be guided into the right path in terms of, you know, handling your errors and doing retrials by default, and all all of that stuff, we want it to be right there and very easy. So that we can minimize, there's always going to be a barrier if you say you have to write a pull request to get support for something as opposed to just writing a plugin. But we want to minimize that as much as we possibly can.
Tobias Macey
0:49:38
And there are a whole bunch of other aspects of this project that we haven't touched on yet that I have gone through in the documentation that I think is interesting, but I'm wondering if there is anything in particular that either of you would like to discuss further before we start to close out the show.
Ben Johnson
0:49:55
And in terms of like the actual technical implementation of vector, I think one of the unique things things that is worth mentioning is one of you know, vectors, and tend to be the single data collector, across all of your different types of data. So we think that's a big gap in the industry right now is that you need different tools for metrics and logs, and exceptions, and traces. And so we think we can really simplify that. And that's one of the things that we didn't touch on very well in this in this podcast. But right now, we support logs and metrics. And we're considering expanding support for different types of observe ability data, so that you can claim full ownership and control of collection of that data
Luke Steensen
0:50:36
and routing of it. Yeah, I mean, I could there, you know, small little technical things within vector that I think are neat talking about for a little while, but I mean, for me, the most interesting part of the project is kind of viewing it through the lens of being a kind of a platform that you program that it's you know, as flexible and programmable, I guess as possible, kind of in the in the vein of you know, those bash one liners that I talked about, that's something that it you know, that can be a lot of fun can be very productive. And the challenge of kind of lifting that thing that you do in the small on your own computer or on a single server or something like that up to a distributed context, I find it you know, a really interesting challenge. And there's a lot of fun little pieces that you get to put together as you try to try to move in that direction.
Tobias Macey
0:51:28
Well, I'm definitely going to be keeping an eye on this project. And for anybody who wants to follow along with you, or get in touch with either of you and keep track of the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, and I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Luke Steensen
0:51:49
For me, I think there's there's so many interesting stream processing systems, databases, tools, and things like that. But there hasn't been quite as much attention paid to the glue, like how do you get your data in? How do you integrate these things together, and that ends up being like a big barrier for getting people to get into these tools and get a lot of value out of them, there's just, there's a really high activation energy. Already, it's kind of assumed that you're already bought into a given ecosystem or something like that, that I mean, that's the biggest thing for me is that it, a lot of for a lot of people and a lot of companies, it takes a lot of engineering effort to get to the point where you can do interesting things with these tools.
Ben Johnson
0:52:33
And like as an extension of that, like that doesn't go just from the collection side, it goes all the way to the analysis side of that, as well. And we think that if, if, you know, are either September's to help empower users to accomplish that, and take ownership of their data and their observer ability strategy, and so like vector is the first project that we're kind of launching and that initiative, but we think it goes all the way across. And so that that like to echo Luke, that is the biggest thing, because so many people get so frustrated with it, where they just throw their hands up and kind of like hand their money over to a vendor, which is, which is fine and a lot of use cases, but it's not empowering. And there's, you know, there's no like silver bullet, like there's no one storage, or one vendor that is going to do everything amazing. And so at the end of the day, it's like being able to take advantage of all the different technology and tools and combine them into like a cohesive observe ability strategy in a way that is flexible. And the to evolve with the times is like the Holy Grail. And that's what we want to enable. And we think, you know, that processes needs quite a bit of improvement.
Tobias Macey
0:53:43
I appreciate the both of you taking your time today to join me and discuss the work that you're doing on vector and timber. It's definitely a very interesting project and one that I hope to be able to make use of soon to facilitate some of my overall data collection efforts. So if appreciate all of your time and effort on that, and I hope you enjoy the rest of your day.
Ben Johnson
0:54:03
Thank you. Yeah, and and just to kind of add to that, if if anyone listening like wants to get involved, ask questions. We have a there's a link community link on the repo itself. You can chat with us. We want to be really transparent and open and we're always welcoming conversations around things we're doing. Yeah, definitely.
Luke Steensen
0:54:22
Just want to emphasize everything Ben said. And thanks so much for having us.
Ben Johnson
0:54:26
Yeah, thank you.
Tobias Macey
0:54:32
For listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used to visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast calm but your story and to help other people find the show please leave a review on iTunes. Tell your friends and coworkers

Building A Community For Data Professionals at Data Council - Episode 96

Summary

Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your original reason for focusing your efforts on fostering a community of data engineers?
    • What was the state of recognition in the industry for that role at the time that you began your efforts?
  • The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?
    • How has the community itself changed and grown over the past few years?
  • Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data?
  • Where do you draw inspiration and direction for how to manage such a large and distributed community?
    • What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?
  • What are some ways that you have been surprised or delighted in your interactions with the data community?
  • How do you approach sustainability of the Data Council community and the organization itself?
  • The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction?
  • In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission?
  • You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?
    • What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?
  • What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?
    • What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses?
  • What are the characteristics of a data business that you look at when evaluating a potential investment?
  • What are some of the current industry trends that you are most excited by?
    • What are some that you find concerning?
  • What are your goals and plans for the future of Data Council?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at lead node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and medium or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Caribbean global intelligence data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Pete Soderling about his work to build and grow a community for data professionals with the data council conferences and meetups as well as his experiences as an investor in data oriented companies. And full disclosure that data Council and clubhouse are both previous sponsors of the podcast and clubhouse is one of Pete's companies that he's invested in. So Pete, can you just start by introducing yourself?
Pete Soderling
0:02:44
Yeah, thanks. Thanks for the opportunity to be here, Tobias. I'm Pete Soderling, as you mentioned, and I'm a serial entrepreneur. I'm also a software engineer from the first internet bubble. And I'm super passionate about making the world a better place for other developers. And you remember it
Tobias Macey
0:02:59
you first get involved in the area of data management?
Pete Soderling
0:03:02
Yeah, I think, funnily enough, the thing that jumps out at me is how excited I was when I was a early developer, very young in my career, I discovered this book database designed for Mere Mortals. And I think I read it over my my holiday vacation one year, and I was sort of amazed at myself at how interesting such a potentially dry topic could be. So that was a, an early indicator. I think fast forward. My first company, actually my second company in 2009, that I started was a company called strata security. And originally, we were building what we thought was a API firewall for a web based API's. But it quickly morphed into a platform that's secured and offered premium data from providers like Bloomberg or Garmin, or companies that had lots of interesting proprietary data sets. And our vision was to become essentially like the electricity in between that data provider and their API. And the consumers were consuming that data to the API so that we could offer basically metered billing based on how much data was consumed. So I guess that was my first significant interest as an entrepreneur in the data space back about 10 years or so ago.
Tobias Macey
0:04:20
And now you have become an investor in data oriented companies, you've also started the series of conferences that were previously known as the data edge confident have been rebranded as data Council. And that all started with your work and founding haka Labs is a community space for people working in the data engineering area. And I'm curious what your original reason was for focusing your efforts in that direction and focusing particularly on data engineers?
Pete Soderling
0:04:51
Yeah, I guess it's, it's gets to the core a bit of who I am. And as I've looked back over my shoulder, as both an engineer and a father, I guess what I've realized, which actually, to some extent, surprised me is that all of the companies I've started, have all had software engineers, as end users, or customers. And I discovered that I really am significantly passionate about getting developers together, helping them share knowledge, and helping them with better tooling, and essentially just making the world awesome for them. And it's become a core mission of everything I do. And I think it basically is infused in all these different opportunities that I'm pursuing. Now. For instance, one of my goals is to help 1000 engineers start companies, but not it gets to some of the startup stuff, which is essentially a side project that we can, we can talk about later. But specifically, as it relates to hacker labs, hacker labs was originally started in 2010, to become a community for software engineers. And originally, we thought that we were solving the engineer recruiting problem. So we had various ideas and products that we tested, rounding, introducing engineers to companies in a trusted way. And originally, that was largely for all software engineers everywhere. And our plan was to turn it into a digital platform. So it was going to have social network dynamics where we were connecting engineers together. And those engineers would help each other find the best jobs. So that was had, you know, very much was sort of in the social networking world. But one of our marketing ideas was we wanted to build a series of events surrounding the digital platform, so that we could essentially lead users from the the community in our events, and introduce them to the digital platform, which was the main goal. And one of the niche areas that we wanted to focus on was data, because data science was becoming a hot thing on data engineering was even more nascent. But I sort of saw the writing on the wall and saw that data engineering was going to be required to support all the interesting data science goodness that was being talked about. And really, I was of interest to business. And so you know, pure the data meetups that we started, were essentially a marketing channel for this other startup that I was working on. And ultimately, that startup didn't work. And the product didn't succeed, which is often the case with network based products, especially in the digital world. But I realized that we had built this brand, surrounding awesome content for software engineers, data engineers through our meetups that we had started. And we fell back on that community, and figured out that there must be another way to monetize and to keep the business going. Because I was so passionate about working with engineers and continuing to build this this community that we had seated. And engineers love the brand. They love the events that that we're running, they loved our commitment to deeply technical content. And so one thing led to another and ultimately, those meetups grew into what data Council is today.
Tobias Macey
0:07:48
And you mentioned that when you first began on this journey, it was in the 2010 timeframe. And as you referred to data engineering as a discipline was not very widely be recognized. And I'm wondering how you have seen the overall evolution of that role and responsibility? And what the state of the industry and what the types of recognition and responsibilities were for data engineers at that time.
Pete Soderling
0:08:16
Yeah, you know, data engineering was just not really a thing at the time. And only the largest tech company is Google and Facebook even had the notion of sort of the data engineering concept. And but I guess, you know, what I've learned from, from watching engineering at those companies is that, because of their scale, they discover things more quickly and more often, or earlier than, than other folks tend to. And so I think that was just interesting, you know, leading indicator and so I, I felt like it was going to get bigger and bigger. But yeah, there was no, I don't even know if if Google necessarily had a data engineering job title at that time. So you know, that was just very early in the in the space. And I think we've seen it develop a lot since since then. And it's not just in the title. But I think, you know, we saw early on that data science was a thing and was going to be a bigger thing. Data engineering was required to to have the data scientists and the quants to do awesome stuff in the first place. And then there's also the analysts who are trying to consume the data sets, oftentimes in slightly different ways, and the data science scientists, so I think early on, we saw that these three types of roles were super fundamental and foundational to building the future of data infrastructure and business insights and data driven products. And so even though we started off running the data engineering meetup, which I think we're still known for, we pretty quickly through the conference embraced these other two disciplines as well, because I think the interplay of how these types of people work together inside organizations, is where the really interesting stuff is. And so you, you know, as these these job descriptions, the themes in these job descriptions and sort of how they unite, and how they work together on projects is fascinating. And so, through data Council, our goal has been to further the industry by getting these folks in the same room around content that they all care about. And sometimes it's teaching a data engineer a little bit more about data science, because that's what makes them stronger and better able to operate on a on a multifunctional team. And sometimes it's the data scientists getting a little bit better at some of the engineering disciplines and understanding more what the data engineers have to go through. So I think that looking at this world in a in a cohesive way, especially across these three roles, has really benefited us and made the community event very unique and strong. And now I should say that, I think the next phase of that in terms of team organization, and especially in terms of our vision with data Council is we're now embracing product managers into that group as well. I think that, you know, there's the stack, we sort of see this stuff, lack of data being data infrastructure on the bottom, then data engineering and pipelines, then data science and algorithms, then data analytics on top of that. And finally, there's the the AI features and the people that are weaponized this entire stock into AI products and data driven features. And I think the final icing on the cake, if you will, is creating space for data oriented product managers, because, you know, it used to be that maybe you think of a data Product Manager is like working for Oracle or being in charge of shipping a database. But that's, you know, that's sort of a bit older school at this point. And there's all kinds of other interesting applications of data infrastructure and data technologies that are not directly in the database world, where real world product managers in the modern era, I'm sort of need to understand how to interact with this stack, and then how to build data tooling, whether its internal tooling for developers, or customer consumer facing, beat. So I think embracing the product manager at the top of this dock has been super helpful for our community as well.
Tobias Macey
0:12:07
And I find it interesting that you are bringing in the product manager, because there has long been a division in particularly with technical disciplines where you have historically the software engineer who is at odds with the systems administrator, and then recently, the data scientist or data analyst who is at odds with the data engineer. But there has been an increasing alignment across the business case, and less so in terms of the technical divisions. And I'm curious what your view is in terms of how the product manager fits in that overall shift in perspective, and what their responsibility is within an organizational structure to help with the overall alignment in terms of requirements and vision and product goals between those different technical disciplines?
Pete Soderling
0:12:59
Yeah, well, hey, I think I think this is just a super This is a My question is a microcosm of what's really happening in the engineering world, because I think software engineers at the core at the central location are actually eating disciplines and roles that used to be sort of beneath them and above them. So again, I'm sort of sticky thinking in terms of this vertical stock. But, you know, most modern tech companies don't have DBS. Because the software engineers now the DPA, and many companies don't have designated infrastructure teams, because a software engineer is responsible for their own infrastructure. And some of that is because of cloud or the dynamics, but sort of what's happening is, you know, at its core, the engineer is eating the world. And it's bigger than just software in the world engineers in the world. And so I think the the absorption of some of these older roles into what's now considered core software engineering has happened below, and I think it's happening above. So I think, some product management is collapsing into the world of the software engineer, or engineers are sort of lathering up into product management. And I think part of that is the nature of these deeply technical products that we're building. So I think many engineers make awesome product managers. I mean, perhaps they have to step away and you know, be willing not to write as much code anymore. But because they have an intrinsic understanding of the way these systems are built. I think engineers are just sort of like reaching and, and absorbing a lot of these other roles. And so some of the best product managers that, you know, I think we've seen have been x software engineers. So I just think that there's a real emerging, this is just a larger perception that I have of the world, into the software engineering related disciplines. And I think it's actually not a far leap, to sort of see how, you know, an engineer, a product manager, who's informed with an engineering discipline is super effective in that role. So I just think this is a broader story that we're seeing overall, if that makes sense.
Tobias Macey
0:15:02
Yeah, I definitely agree that there has been a much broader definition of what the responsibilities are for any given engineer, because of the fact that there are a lot more abstractions for us to build off of, and so it empower his engineers to be able to actually have a much greater impact with a similar amount of actual effort, because of the fact that they don't have to worry about everything from the silicon up to the presentation layer, because there are so many different useful pre built capabilities that they can take advantage of and think more in terms of the systems rather than the individual bits and bytes. Yeah, exactly. And in terms of your overall efforts for community building, and community management, there are a number of different sort of focal points for communities to grow around that happen because of different shared interests or shared history. So there are programming language communities, their communities focused on disciplines, such as, you know, front end development, or product management or business. And I'm wondering what your experience has been in terms of how to orient a community focus along the axis of data, given that it can in some ways be fairly nebulous as to what are the common principles in data because there's so many different ways to come at it, and so many different formats that data can take?
Pete Soderling
0:16:31
Yeah, I think the core, you know, one of the core values for us, and I don't know if this is necessarily specific to date or not, but it's just openness. And I think especially, you know, we're, we see ourselves as much, much more than just a conference series, and we use the word community, in our team and at our events, and just to describe what we're doing dozens and dozens of times a week. And so, yeah, I think the community bond and the mentality for our brand is super high. I think that, you know, there's a, there's also an open source sort of commitment. And I think that's a mentality, I think that's a that's a coding, discipline, and style. And I think that, you know, sharing knowledge is just super important in any of these technical fields. And so, engineers are super thirsty for knowledge. And we see ourselves as being a connecting point where engineers and data scientists can come and share knowledge with each other. I think especially maybe that's a little bit accelerated, in the case of data science or AI research, because these things are changing so fast. And so there is a little bit of an accelerator in terms of the way that we're able to see our community grow and the interest in this space, because so much of this technical stuff is changing quickly. And, you know, engineers need a trusted source to come to where they can find and get surface the best, most interesting research and most interesting open source tools. So we've capitalized on that, and, and we try and be an one on one, and we're sort of a media company. You know, on the other hand, we're an events business. On the other hand, we're a community. But we're sort of putting all these two things together in a way that we think benefits careers for engineers, and it enables them to level up in their careers and make them smarter and get better jobs and meet awesome people. So really, all in all, you know, we see ourselves as building this building this awesome talent community, around data and AI, worldwide. And we think we're in a super unique position to do that and
Tobias Macey
0:18:33
succeed at it. community building can be fairly challenging because of the fact that you have so many different disparate experiences and opinions coming together. And sometimes it can work out great. Sometimes you can have issues that come up just due to natural tensions between people interacting in a similar space. And I'm wondering, what you have been drawing on, for instance, and reference for how to foster and grow a healthy community, and any interesting or challenging or unexpected aspects of that overall experience of community management that you've encountered in the process?
Pete Soderling
0:19:13
Yeah, I think it's an awesome question. Because any company that embraces community, to some degree embraces perhaps somewhat of a wild wild west. And I think some companies and brands manage that very heavily top down, and they want to, and they have the resources to, and they're able to some others, I think, let the community sprawl. And, you know, in our particular case, because we're a tiny startup, I used to say that we're three people into PayPal accounts, I'm running events all over the world, you know, even though we're just a touch bigger than that now, not much. But we have 18 meetups all over the world and forming conferences from San Francisco to Singapore. So I think the only way that we've been able to do that, and just to be clear, like we're up for profit business, but I think that's one of the other things that makes us super unique is that, yes, we're for profit. But at the same time, we're embracing a highly principled notion of community. And we use lots and lots of volunteers in order to help you know further that message worldwide, because we can't afford to hire community managers in every single city that we want to operate in. So so that that's one thing. And I guess, for us, we've just had to embrace kind of the the wild nature of what it means to scale a community worldwide and deal with the ups and downs and challenges that come with that. And, of course, there's some brand risk. And there's, you know, other sorts of frustrations, sometimes working with volunteers, but I guess my inspiration, you know, specifically in this was really through through 500 startups, and I went on geeks on a plane back in 2012, I believe. And when I saw the way that 500 startups, which is a startup accelerator in San Francisco, was building community, all around the world, basically one plane at a time. And I saw how kind of wild and crazy that community was, I sort of learned, like the opportunity and the challenge of building community that way. And I think the main thing, you know, if you can embrace the chaos, and if your business situation forces you to embrace the chaos order to scale, I think the main way that you keep that in line is you just have a few really big core values that you talk about, and you emphasize a lot, because basically, the community has to sort of manage itself against those values. And you know, this, this isn't like a detailed, like, heavy takedown process, because you just can't in that scenario. So I think the most important thing is that the community understands the ethos of what you stand for. And that's why with data Council, you know, there's a couple things I already mentioned open source, that's super important to us. And we're always looking for new ways to lift up open source, and to encourage engineers to submit their open source projects for us to promote them. we prioritize open source talks at our conference. You know, that's just one one thing. I think the other thing for data Council is that we've committed to not be an over sponsored brand. This can make it hard economically for us to be able to grow and, and build the to hire the team that we want to sometimes, but we're very careful about the way we integrate partners and sponsors into our events. And we don't want to be, you know, what we see as some of our competitors being sort of over saturated and swarming with salespeople. So there's a few like, Hi thing, I guess the other thing that that's super important for us is we're just deeply, deeply committed to deeply technical content. We screen all of our talks, and we're uncompromising in the in the speakers that we put on stage. And I think all of these things resonate with engineers like I'm, I'm an engineer. And so I know engineers think and I think these three things have differentiated us from a lot of the other conferences and, and events out there, we realized that there was space for this differentiation. And I think all these things resonate with engineers. And now it makes engineers and data scientists around the world want to raise their hands and help us launch meetups, we were in Singapore. Last month, we launched our first data data council conference there, which was amazing. And the whole community came between Australia and India and the whole region, Southeast Asia, they were there. And we left with three or four new meetups in different cities, because people came to the conference saw what we stood for, saw, they were sitting next to awesome people and awesome engineers. And it wasn't just a bunch of suits at a data conference. And they wanted to go home and take a slice of data console back to their city. And so we support them in creating meetups, and we connect them to our community, and we help them find speakers. And it's just been amazing to really see that thrive. And I think like I said, the main the main thing is just knowing the the core ethos of what you stand for. And even in the crazy times, just being consistent about the way you can you communicate that to the community, letting the community run with it and see what happens. And sometimes it's it's messy, and sometimes it's awesome. And but you know, it's a it's an awesome experiment. And I just think it's incredible that a small company like us can have global reach. And it's only because of the awesome volunteers, community organizers, meetup organizers, track host for our conference that we've been able to suck into this into this orbit. And we just want to make the world a better place for them. And they've been super, super helpful and, and kind and supporting us, and we couldn't have done it without them. So it's been an awesome experiment. And, you know, we're continuing to push forward with that model.
Tobias Macey
0:24:33
With so many different participants and the global nature of the community that you're engaging with, there's a lot of opportunity for different surprises to happen. And I'm wondering what have been some of the most interesting or exciting to paraphrase Bob Ross happy accidents that you have encountered in your overall engagement with the community? Hmm,
Pete Soderling
0:24:57
I guess, this wasn't totally unsurprising. But I just love to sort of surround myself with with geeks, you know, geeks have always been my people. And even when I stopped writing code actively, I just gravitated towards software engineers, and obviously, which is why I'm sort of, you know, I do what I do. And it's what makes me tick. I guess one of the interesting thing through running a conference like this is, you get to meet such awesome startups. And there's so many incredible startups being started outside of the valley. You know, I lived in New York City for many years, and I lived in in San Francisco for many years. And now I spend most of my time in Wyoming. So I'm relatively off the map. And one way of thinking but in the other way, you know, as the center of this conference, we just meet so many awesome engineers and startups all over the globe. And I'm really happy to see such awesome ideas start to spring up from other startup ecosystem. So, you know, I don't believe that all the engineering talent should be focused in Silicon Valley, even though it's easy to go there, learn a ton really benefit from that better from the community benefit from the the big companies with scale. But ultimately, I think, you know, not everyone is going to live in the Bay Area, I hope they don't, because it's already getting a little messy there. But I just want to see, you know, all of these things sort of democratize and distributed, both in terms of software engineering, and then the engineers that start these awesome startups. And so, you know, the the ease with which I'm able to meet and see new startups around the globe, to the data, the data council community, I think it's been a real bright spot in that. And I don't know if it was necessarily a surprise, but maybe it's been a surprise to me at how quickly it's happened.
Tobias Macey
0:26:34
So one of the other components to managing communities is the idea of sustainability and the overall approach to governance and management. And I'm wondering both for the larger community aspect, as well as for the conferences and meetup events, how you approach sustainability to ensure that there is some longevity and continuity in the overall growth and interaction with the community?
Pete Soderling
0:27:02
Yeah, I think I think the main thing, you know, this gets back to another core tenet of sort of the psychographic of a software engineer, software engineers need to know how things work. And that's sort of the core of our mentality in building things. We want to know how things work, if we didn't build it ourselves. We prefer to like, rip off the covers and understand how it works. And you know, to be honest, part of the way that for instance, we select talks at our conference, you know, I think this applies to and we're learning to get better about. I mean, I think as a as a value, we believe in openness and transparency. In our company, I think externally facing, we're getting better about how we actually enable that with the community. But for instance, for our next data council conference that's coming up in New York, and in November, we've published all of our track summaries on GitHub, and we've opened that up to the community where they can contribute ideas, questions, maybe even speakers, theme sub themes, etc. And I think just the nature that, you know, we have the culture to start to plan, our events like this in the open, I think, brings a lot more transparency. And then I guess the other thing about a community that's just sort of inherent, I think, in a well run community, is the amount of diversity you get. And obviously, you know, we're all aware of that, that software engineering as a discipline, is just suffering from a shortage of diversity in certain areas. And I think as we commit to that, locally, regionally, globally, there's so many types of different diversity that we get through the event. So I think both of these things are, you know, are super meaningful in like keeping the momentum of that community moving forward, because we want to continue to grow. And we want to continue to grow by welcoming folks that maybe necessarily didn't necessarily previously identify with the data engineering space, you know, into the community so that they can see what it's like and evaluative if they want to take a run that in their career. So I think all these things, transparency, openness, diversity, these are all Hallmark hallmarks of a great community and, and these are the engines that keep the community going and moving forward. Sometimes in spite of the resources or the lack of resources, you know, that a company like data council itself, can muster at any one time.
Tobias Macey
0:29:22
In terms of the conference itself, the tagline that you focused on, and we've talked a little bit about this already, is that they are no fluff, paraphrasing your specific wording, and as a way of juxtaposing them against some of the larger events that are a little bit more business oriented, not calling out any specific names. And I'm wondering what your guidelines are for fulfilling that promise, and why you think it is an important distinction? And conversely, what some of the benefits are for those larger organizations, and how the two scales of event play off each other to help foster the overall community?
Pete Soderling
0:30:02
Yeah, well, one, one thing here is, I think, comes to the mentality of the engineer. And then the other side of it is the mentality of the sponsor and the partner. And, you know, hey, I think engineers are just Noble. And like I said, engineers want transparency, they want to know how things work. They don't want to be oversold to, you know, they want to try product for the self. There's just all of these sort of things baked into the engineering mindset. And first and foremost, we want to be known as the conference in the community that respects that, like, that's the main thing, because engineers like without engineers, and our community, loving and getting to know each other, we're not careful about the opportunities in the context that we create for them, they're just going to run in the other direction. And so like, first and foremost, like those are the hallmarks of of what we're building from the engineering side. Then on the partnership side, I think companies are not great at understanding how engineers think recruiters are not great at talking to engineers, marketers are not great at talking to engineers. Yes, engineers need jobs. And yes, engineers need new products and tools, but to find companies that actually know how to respect the mental hurdles that engineers have to get through, in order to like get interested in your product or get interested in your job. You know, that's a super significant thing. And through my years of working in the space, I've done a lot of coaching and consulting with CTOs, specifically surrounding these two things, recruiting and, and marketing to engineers. And I think that awesome companies who respect the the central place that engineers have, and will continue to have in the innovation economy that's coming, realize that they have to change their tune in the way they approach these engineers. So I, you know, our conference platform is a mechanism that we can use to gently sort of steer and even teach some partners how to interact with engineers in a way that doesn't scare them away. And so just broadly speaking, like I mentioned, we're just super careful about how we integrate partners with our event. And we're always as a team trying to come up with, with better ways to message this and, and better ways to educate and, and sort of welcome sponsors, you know, into the special, the special network that we've built, but it's challenging, you know, like not not all marketers think alike. And not all marketers know how to talk to engineers, but we're committed to creating a new opportunity for them to engage with awesome data scientists and data engineers in a way that's valuable for both of them. And that's a really fun, big challenge. And, you know, we're not as worried about how much as it scales right now, as much as we were the quality, enhancing the quality of those interactions. And so that's what we're committed to as a brand. And, you know, it's not always easy. But we've we learned a lot, and we have a lot to learn. And we always sort of touch touch base with the community after the events and sort of asked the community what they thought and how they interact with the partners, then did they find a new job? And how did that happen? And so we're always trying to pour gasoline on what works, not respecting continue to innovate and move forward in that way,
Tobias Macey
0:33:03
in addition to your work on building these events, and growing the meetups, and overall network opportunities for people in the engineering space, you have also become an involved as an investor. And you've mentioned that you focus on data oriented businesses. And I'm curious how you ended up getting involved in that overall endeavor, and how that fits into your work with data Council and what your overall personal mission is. Oh, yeah.
Pete Soderling
0:33:30
Well, that's, that's definitely one of my side projects. As I mentioned, I want to help 1000 engineers start companies, and, you know, this is just part of what makes me tick, just helping software engineers through the conference, through advising their startups, you know, through investing through helping them figure out go to market, I guess a lot of this, this energy for me came, you know, from having started for companies, myself, and as an engineer who didn't go to school, but instead opted to start, you know, my first company in New York City in 2003. You know, there weren't a lot of people that had started three companies in New York by the time, the early, you know, sort of, or the layoffs came around. So yeah, I guess I've learned a lot of things the hard way. And I think a lot of engineers are kind of self taught. And they also learn things, they tend to learn things the hard way. So I guess, a lot of my passion there is again, sort of meeting engineers, where they're at how they learn. And you know, to them, like, I'm kind of a business guy. Now, I have experience with starting companies, building products, fundraising, marketing, building sales teams, and you know, most of those things are not necessarily been Top of Mind, for many software engineers that want to start a company, they have a great idea. They're totally capable of engineering it and building a product, but they need help, and all the other, you know, software, businesses, LZ stuff, as well as fundraising. So I guess I just figured, I've discovered the sort of special place I have in the world where I'm able to help them coach them through a lot of those businesses is I could never build the infrastructure that they're building or figure out the the algorithms or the models that they're building, but I can help them figure out how to pitch it to investors, how to pitch it to customers, how to go to market, how to hire a team that scales. And so I discovered that I just had, you know, through my ups and downs, as an entrepreneur, I've developed a large set of early stage hustle, experience, and I'm just super hot, happy to pass that on to other engineers who are also passionate about starting companies. So that's just something I find myself doing anyway, you know, as a mentor for 500 startups or as a mentor for other companies. And one thing led to another and soon I started to do angel investing. And now I have an Angeles syndicate, which is actually quite interesting, because it's backed by a bunch of CTOs and heads of data science and engineers from our community, who all co invest with me. And as I'm able to help companies bring their products to market startups come to market, oftentimes will be an investment opportunity there. And so I'll be another network of technical people who add value to that company even more. So I'm just sort of the, you know, a connector in this community. And the community is doing all kinds of awesome stuff, you know, inside and even sometimes outside of data console, which is just a testament to the power of community overall. And, you know, I just happened to be, I'm super grateful that I'm along for the ride. And I got to I got engineers who come to me and trust me for help. And I'm able to connect these dots and and help them succeed as well,
Tobias Macey
0:36:30
in terms of the ways that businesses are structured. I'm wondering what it is about companies that are founded by engineers and led by engineers that makes them stand out, and why you think that it's important that engineers start their own companies, and how that compares to businesses that are founded and run by people who are coming more from the business side. And just your overall experience in terms of the types of focus and levels of success that two different sort of founding patterns end up playing out?
Pete Soderling
0:37:03
Yeah, well, yeah. I mean, you can tell based on what I've been saying that I'm just super bullish on the software engineer. And, you know, does that mean that the software engineer as a persona or a mentality or a skill set, you know, is inherently awesome and has no weaknesses? And no problems? Like hell? No, of course not. And I think the some of the challenges of being a software engineer and how your mentality fits into the rest of the business are well documented. So I think all of us as engineers need to grow and diversify and increase the breadth of our skills. And so that has to happen. But on the other hand, if we believe that innovation is going to continue to sweep the world, and highly technical solutions, perhaps to sometimes non technical problems, perhaps sometimes the technical problems are going to continue to emerge. I feel like people who have the understanding of the the technical implications and the realities and the architectures and the science of those solutions, just have an interesting edge. So I think there's a lot of hope in teaching software engineers how to be great CEOs. And I think that's, that's increasingly happening. I mean, look at Jeff Lawson from Twilio. Or the guys from stripe, even Uber was started by an engineer, right? There was the the the quiet engineer at goober at Uber, Garrett, sort of quiet in terms of Travis, you know, who, who was a co founder of that company. So I think we're seeing engineers, not just start the most amazing companies that are that are changing the world, but they're increasingly in positions of becoming CEOs. And those companies, you know, I guess you might even take that one step further. And I'm kind of trying to be an engineer who's also been an operator and a founder. But now I'm, I'm stepping up to becoming a VC and, and being an investor. So I think there's the engineer VC, which is really interesting, as well. But I think that's a slightly different conversation. But but suffice it to say that engineers are bringing a valid mentality into all of these disciplines. And yes, of course, an engineer has to be taught to think like a CEO, and has to learn how to be strategic and has to learn sales and marketing skills. But I think it's just an awesome, awesome challenge to be able to layer those skills on top of engineering discipline that they already have. And I'm not saying this is the only way to start a company or that business people, you know, can't find awesome engineers to partner with them. I mean, honestly, I think an engineer often needs a business co founder, to help get things going. But I I'm coming at it from the engineering side, and then figuring out like, all the other support that the engineer needs to make a successful company, and that's just because I've chosen that particular way, but other people will be coming at it from the business side, and, and I'm sure that will be fine for them as well,
Tobias Macey
0:39:47
in terms of the challenges that are often faced for an engineering founder in growing a business and making it viable. What are some of the common points of conflict or misunderstandings or challenges that they encounter? And what are some of the ways that you typically work to help them in establishing the business and setting out on a path to success? Well, I
Pete Soderling
0:40:10
think the the biggest thing, you know, that I see is, many engineers are just afraid to sell. And unfortunately, you know, you can't have a business if you can't have some sales. And so somehow, engineers have to get over that hurdle. And that can be a long process. It's been a long process for me. And I still undersell what we do at data council to be honest, in some ways, and I have people around me to help me do that. And we want to do that, again, in a way that's in line with the community. But I'm constantly trying to figure out how to be essentially a better salesperson. But for me, that means that still retaining sticking to the core of my engineering values, which is honesty, transparency, enthusiasm, you know, value and really understanding how to articulate the value of what you're bringing in a way that's, that's unabashedly honest and transparent. So I think that's a, that's a really big thing for a pure engineer founder is, it can be difficult to go out there and figure out how to get your first customers, you know, how to start to sell yourself personally. And then the next step is how do you build a sales culture and a process and a team that's in line with your values as a company, and that scares, you know, that scares some engineer, because it's just terrifying to think about building a sales org, when you can barely accept that your product needs to be sold yourself. But I think that's just a you know, that's sort of ground zero for starting a company. And so, you know, I try and be as gentle as possible and, and sort of guiding engineers through that process. But I guess that's the one of the core hiccups that I think engineers have to figure out how to get over by bringing in other people they trust or getting advice, or, you know, you have to approach it sometimes with kid gloves, but teaching engineers how to sell in a way that's true to their values. I think it's just a really big, big elephant in this room that, you know, I constantly run into and, and try to help engineer founders with
Tobias Macey
0:42:08
in terms of the businesses that you evaluate for investing in what is your strategy for finding the opportunities? And what are the characteristics of a business that you look for when deciding whether or not to invest? And then as a corollary to all of that? Why is it that you're focusing primarily on businesses that are focused mainly in the data space, and the types of services that you're looking for, that you think will be successful and useful?
Pete Soderling
0:42:37
Yeah. Well, I guess last question. First, I think, you know, it's important to have focus as an investor, and not everybody can do everything awesome. I think it's also a strategy to building a new brand, and a niche fund in the face of the sequoias and the corners of the world. I think we might have like we might be past the day is when a new fund can easily build a brand. That's, that's that expansive. So I think this is just kind of, you know, typical marketing strategy. I think if you start to focus on a niche and do one niche really well, I think that produces network effects, smaller network effects inside that niche that then grow and grow and grow and develop. So I mean, I've chosen to focus in my professional life, on data, data, data science, data engineering, data analytics, that's data console. Partly that's because I just believe in the upside of that market. So I think that's just a natural analogue to a lot of my investing activity, because I'm knowledgeable about the space because I have a huge network in the space, because I'm looking at interested in talking to these companies all the time. Um, it's just a natural fit. That's not to say that I don't I mean, I'm also passionate about broader developer tools. And as you mentioned earlier, I'm an investor in clubhouse. I'm interested in security companies. So I think, you know, for me, there are some analogues to just like, deeply technical companies, you know, look by technical founders, that that also fit my thesis. But, you know, still it's a fairly niche, narrow thesis, like most of the stuff I do. On the investing side is b2b, I meet the companies through my network and, and through data Council, I think they're solving you know, meaningful problems in the b2b space. And other criteria often is, they may be supported by some previous open source success, or the company might be built on some current open source project that I feel gives them an unfair advantage when it comes to distribution, and getting a community excited about the product. So these are a few of the things that I look for, in terms of the investing thesis
Tobias Macey
0:44:39
in terms of the overall industry and some of the trends that you're observing and your interaction with engineers and with vetting the different conference presentations and meetup topics, what are some of the trends that you're most excited by? And what are some of the ones that you are either concerned by or some potential issues that you see coming down the road in terms of the overall direction that we are heading as far as challenges that might be lurking around the corner?
Pete Soderling
0:45:10
Well, I think, you know, one big thing there is just data science and ethics, ai fairness, bias in AI, and in deep learning models, ethics, when it comes to, you know, targeting customers, what data you keep on people, like I think all these things are just super interesting is business issues, their policy issues or business issues. At one level, they're also technical issues. So there's technical implementation stuff that's required. But I just think raising that discussion is important. And so that's one area that we're focusing on, and data Council in the next series of events that we run later this year. And next year, is elevating that content in front of our community so that it can be a matter of discussion, because I think those are important topics are always seen as the most technical. But I think they're super important in terms of us, trying to help the community steer and figure out where the ship is going in the future. So I think that's super interesting. And then, I guess on the technical side, I think the data ops world is starting to mature. I think that there's a lot of interesting opportunities there in the same way that the DevOps revolution, you know, revolutionized the way that software was built, tested, deployed, monitored, and companies like chef and New Relic, you know, came came out in perhaps the mid 2000s. I think we're at a similar inflection point, with data Ops, if you will. And there's more repeatable pipeline process. There's back testing and an auditing capabilities that are required for data pipelines that aren't always there. There's monitoring infrastructure that's being built, and some some cool companies that I've seen recently. So I think data Ops, and basically just elevating data engineering to the level that software engineering has been out for a while. It's definitely something that seems to be catching fire. And we also, you know, try and support to the conference as well.
Tobias Macey
0:47:06
Looking forward, what are some of the goals and plans that you have for the future of the data council business and the overall community and events?
Pete Soderling
0:47:17
Well, I think, as I mentioned, our biggest goal is to build the data and AI talent community worldwide. And I think that there's, we're building a network business. So I guess it kind of takes me back to when I started, hacker labs, which was the digital social network, and I thought we were building a digital product. And as I already mentioned, one thing led to another and now we have data console instead. Well, data console is, you know, butts and seats at events and community. And it's IRL, engineers getting together. But it's still a network. It's not a super fast digital, formal network. But it's a network. And it's a super meaningful network. So it's kind of interesting that after all, the ups and downs, like I still see ourselves as in the network building business. And I think the cool thing about building a network is once you build a network, there's lots of different value that you can add on it. So I think in terms of there's, there's really big ideas here, there's formalizing education and education offerings, there's consulting services that can be built on top of this network, where companies can help each other, or recruiting sort of fits in the same vein, I think there's there's other things as well, there's a fund, which I have mentioned is a side project that I have to help engineers in this community start companies. So I think there's, there's all kinds of interesting things that you can build on top of a network, once you've gotten there. And for now, you know, our events are essentially a breakeven business that just gives us an excuse, and the ability to grow this network, all around the world globally. But I think, you know, there's a much bigger, like phase two, or phase three of this company where we can build really awesome stuff based in this engineering network, and network and a brand that engineers trust that we've laid down in the in the early part of the building phase. So I'm really excited to see that and, and develop that strategy and mission going forward.
Tobias Macey
0:49:14
Are there any other aspects of your work on data council or your investment, or your overall efforts in the data space that we didn't discuss yet that you'd like to cover before we close out the show?
Pete Soderling
0:49:26
No, I think I think we covered a lot of stuff. I hope it was interesting for you and your audience. And I encourage folks to reach out to me and, and get in touch. If there's engineers out there that want to start companies. If there's engineers that want to participate in our community worldwide, we're always looking for awesome people to help us find screen talks. We're interested in awesome speakers as well. I'm always interested in talking to deep learning and AI researchers who are out there who might have ideas that they want to bring it to market. But yeah, you can reach me at Pete data council dot A. And I'm happy to plug you into our community. And yeah, if I can be helpful to anyone out there, I would just really encourage them to reach out.
Tobias Macey
0:50:09
And for anyone who does want to follow up with you, or keep in touch or follow along with the work that you're doing. I'll have your contact information in the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Pete Soderling
0:50:26
Yeah, I think, as I mentioned, I think maybe it comes down to this, this data ops thing, right? There's there's really interesting open source projects coming out, like Great Expectations is one interesting companies coming out like elemental, which is built around Dexter, which is an open source project. So I think that this is a really interesting niche sort of tooling area, specifically in the data engineering world that I think we should all be watching. And then I guess the other sort of category of tooling, I'm seeing it sort of related. It's also in the monitoring space, it's watching the data in your data warehouse, to see if there's anomalies that sort of pop up, because we're all we're pulling together data from so many different hundreds of sources now that I think it's a little bit tricky to watch for data quality, and integrity. And so I think there's a new suite of tools that are popping up in that data monitoring, um, space, which are very interesting. So those are a couple of areas that I'm interested that I'm interested in and looking at, especially when it comes to data engineering applications.
Tobias Macey
0:51:26
Well, thank you very much for taking the time today to share your experiences and building and growing these events series and contributing to the overall data community as well as your efforts on the investment and business side. So definitely an area that I find valuable and I've been keeping an eye on your conferences. There's been a lot of great talks that come out of it. So I appreciate all of your work on that front, and I hope you enjoy the rest of your day.
Pete Soderling
0:51:50
Yeah, thanks, Tobias. We'll see you at the data council conference sometime soon.
Tobias Macey
0:51:59
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts FD to engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers

Building Tools And Platforms For Data Analytics - Episode 95

Summary

Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing some of the main features that you are looking for in the tools that you use?
  • What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack?
  • What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists?
    • In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict?
  • In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)?
  • What are some anti-patterns that data engineers can guard against as they are designing their pipelines?
  • In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with?
  • How much understanding of analytics are necessary for data engineers to be successful in their projects and careers?
    • Conversely, how much understanding of data management should analysts have?
  • What are the industry trends that you are most excited by as an analyst?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at winnowed. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and ODE today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in database is streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers then don't miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum and data Council in Barcelona. Go to data engineering podcast.com slash conferences, to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Ben Stansell, Chief analyst at mode analytics about what data engineers need to know when building tools for analysts. And just full disclosure, that mode is a past sponsor of this show. So Ben, could you start by introducing yourself,
Benn Stancil
0:01:52
I am one of the founders and chief analyst of node node builds a product for data and also data scientists. So I'm responsible for both our internal analytics here at mode, as well as working a lot withour customers to help them better understand orhelp us better understand the needs that they have in the product, and how we can we can better serve them as an analyst data scientists. So prior to mode, I worked on the analytics team, Yammer, which was a startup that was purchased by Microsoft in 2012. And before that, my background is in economics and math. And so I actually worked for a think tank in DC for a few years doing economic research before landing in San Francisco and
Tobias Macey
0:02:29
in the tech world. And do you remember how you first get involved in the area of data management?
Benn Stancil
0:02:33
Yeah, so it was actually as a customer, really. So I was working as an analyst at Yammer, my first job and tech was was a gamer. And I was really a customer over data engineering team. So we use the tools that they built, as well as the data that they provided to Yammer was was kind of one of these early leaders and the philosophy that engineers shouldn't build ETL pipelines, which is now something that's become a little bit more popular. There's a blog post, from the folks over at Stitch Fix that talked about this very explicit. But Yammer had this the same philosophy. And so while I was there, we were responsible for building our own pipelines and for sort of dipping our toes into the data engineering and data management world. And so that was kind of my first taste of it. Then after leaving Yammer, and starting mode, which I've mentioned is is a product for data analyst data scientists, I actually ended up taking kind of two further steps into data management. First, I'm responsible for our own data infrastructure here at mode. And so my role is to think about not just the analytics that we do, but how we actually get the data in the place that we want to get it. But also a lot of ways, mode is serving the same problem or serving the same providing the same service that our the Yammer data engineering team was providing me when I was an analyst, which is we are now building tools for other data scientists and so that the product that we provide, we very much have to think about how does it fit into the data management ecosystem? How does it solve the problems that not just analysts data scientists have, but the problems that data engineers have, when they're trying to trying to serve those,
Tobias Macey
0:03:58
those customers. And so you've mentioned that your work at mode, you're actually responsible for the data management piece, and that you're working closely with other data engineering teams and other analysts to make sure that they are successful in their projects. And I'm wondering if you can start by describing some of the main features that you are generally looking for in the tools that you use and some of the top level concerns that you're factoring into your workout mode, and the tool that you're providing to other people?
Benn Stancil
0:04:27
Yeah, so internally at mode, the one of the things that we really care about is, we want to make it something that is easy to use for the analysts and data scientists are actually consuming that data. So again, kind of come back to the to the point from that stitch pitch, stick fix blog post, we really believe that that the data scientists here at mode should be responsible for as much kind of data management as possible that there's a lot of great tools out there now, that are ATL tools or warehouse tools, or pipeline tools, that analysts can manage pretty well. And you don't have need someone to be kind of a dedicated capital E engineer to really build out the initial phases of the pipeline. And so for us, when we, when we evaluate those tools, internally, we want to make sure that that there are things that we can set up pretty easily. And there are things that as customers have those tools who aren't necessarily the, again, fully fledged engineers, ourselves, we still know how it works, and can still make sure that it's up and running and performing the way we want. I think the analogy we often use with this is it's like buying a car, that you don't necessarily need to know the ins and outs of how the car works. But you need to know that it's reliable. And if you learn to not trust the car that is not actually going to drive and learning to drive. You don't want to actually learn how to fix the car, you just want to buy a different car that actually works. And so when we're when we're looking for tools ourselves, we tend to focus a lot on that on like, what's the experience like for for the folks who are using it? Can we rely on it? And is it something that we need to, you know, have a dedicated person to run? Or is it something that we can kind of run in the background and the the analysts,
0:06:01
the data scientists can get it to work the way I like to work.
0:06:05
The other thing I think that we really look for is usability. So I think this is a place where where ETL tools and data pipeline tools, the folks who are building them often often don't think about as much as perhaps they could, which is the surface area of those tools isn't the application itself or the web interface, I really think of the surface area those tools as the data itself, that that if I'm using an ATL tool, the way that I interact with that tool day in and day out is by actually interacting with the data that that tool is providing, not by logging into the web interface and and you know, checking the status of the pipelines and things like that. And so in those cases, little things matter, it ends up being column names that matter, like, Are there weird capitalization schemes, or are there periods and column names, and those little things that make it more frustrating to work with that day in and day out, end up being things that really drive kind of our experience with those tools, working with customers. And so most customers range from being being small startups to much larger enterprises. I think for small startups, they often look like us. For the large enterprises, the place that we really try to try to focus is making sure that the tools that we recommend your modular that data stacks end up becoming very complicated, they end up having to serve a lot of different folks across a lot of different departments, pulling data from tons of different sources. We try to avoid people focusing on like one tool to rule them all. This kind of having one pipeline, one warehouse, one analytics tool, all of these things serving every need, I think is often a it sounds nice, it's often very difficult to actually create that. And we'd rather people be able to kind of modularize different parts of their stack so that if something new comes along that they want to use, they can easily swap something else in and out without having to kind of re architect the entire the entire pipeline.
Tobias Macey
0:07:53
A couple of things that came out at me, as you were talking, there are one being that you're talking a bit about some of the hosted managed platform tools, where anybody can just click a button and have something running in a few minutes. And then on the other side of the equation, particularly if you have a very engineering heavy organization or a larger organization, you're probably going to be having dedicated data engineers building things either by cobbling together different open source components or building something from scratch. And I'm curious what you have found to be the juxtaposition as far as the level of polish and consideration for user experience of the analyst at the other side of the pipeline, as you have worked with different teams and different tools that fall on either side of that dividing line of the managed service versus the build your own platform.
Benn Stancil
0:08:46
So the the managed service, it depends on the tool, I think we've some tools do a really great job of this some tools less so I think that's it's probably true for any products. And in a similar space, some some products do a really great good job thinking about the experience for customers and others are more focused on technical reliability or more focused on on other aspects of of that product experience. You know, I think that that for an example of of one place where I think like, there's a great tool that focuses on some things that didn't work really well, but also has one of these pain points, snowflake, for instance. So we're actually snowflake customers, the database, it's a very powerful tool for us, we recommend it to both anybody, but they they I believe in the tradition of the Oracle folks from which it came, all of their column names are automatically capitalized. And so it's just one of those small irritations that, that it seems like when they developed it, it wasn't necessarily a consideration of how our analysts going to be interacting with this day in and day out, when all of your queries are constantly yelling at you, because you have to
0:09:45
capitalize everything. So
0:09:46
little things like that, I think are places where companies could could think a little bit more about the the ways that people use it from from the perspective of internal tools. And the folks who are building these from from scratch, I actually think that a lot of cases those tools to to be better in the ways that they think about user experience, because the people who are building them are sitting directly next to the customers that they're providing it for that if you're an engineer, and your customers, the desk over from you, all of those things, there's like a really fast back and forth between how do you actually use this, you see somebody use it every day you hear them complain about it, they like
0:10:19
get the benefit that vendors don't get of
0:10:22
literally working with their customer day in and day out, and their customer being able to like walk over to their desk and say, Hey, this is a thing that can you change and stuff like that. So while the internal tools often aren't as technically robust, and often artists is reliable, and a lot of other respects and aren't nearly as powerful and flexible. The small things often work a little bit better, because they were custom built for for exactly
Tobias Macey
0:10:42
that audience for the things that you're talking about that contribute to just an overall better experience for the analyst. Things like managing the capitalization of column names or preventing the insertion of dots in the column name that will break a sequel query, what are some of the guardrails that data engineers should be thinking about in terms of how their tools are able to generate specific types of output or the overall ways that they can be thinking of the end goal as they're building these different tools and pipelines that would make it easier for analysts and other people within the business to be able to actually make effective use of the end result of their work?
Benn Stancil
0:11:23
Yeah, I think it's, it's being a product manager and a
0:11:25
lot of ways it's it's doing the research and knowing your customer,
0:11:29
that those little things
0:11:30
aren't things that are necessarily going to be obvious. And it's very difficult to sort of build a framework to show you exactly how those things will work. I think the the best framework is the frameworks that product managers or designers build of, of how do you understand the needs of your customer? How do you engage with your customers and learn from them
0:11:48
that you know, even even as an analyst
0:11:50
and as someone who lives in these tools, day in and day out, and is the customer of those a lot of respects, I don't know that I could sit down and make a list of here's all the little things that I like or don't like it something that you very much realize, as you're working on it, in the same way that I'm sure for engineering products are tools that are built for engineers, they have opinions about about how those should be built, but don't necessarily have like, an ability to just write down these are this is the framework for understanding all these things. And a lot of it is wanting to build something that that your customers like and then taking the time to listen to them and understand sort of, what are these pain points they're having? And where do they come from? Like, why what are you trying to accomplish? When you do it? I think a lot of it is sort of the fundamental aspects of of product management, and
0:12:32
user research to really get at the core of what those problems are,
Tobias Macey
0:12:35
from the perspective of team dynamic or an organizational approach to managing these data projects. How much responsibility Do you feel lies with the data engineering team to be able to ensure that they're considering these end user requirements, as far as the capitalization of schema names or things like that, when they're also concerned with trying to make sure that there is able to obtain source data, they want to make sure that their pipelines are reliable and efficient. And they have, you know, maybe n plus one different considerations for the overall management of the data itself before it gets delivered. And how much of it is a matter of incorporating analysts in that overall workflow, basically, just trying to get the breakdown of where you see the level of understanding and responsibility for identifying and incorporating these UX points in the overall lifecycle of building these products.
Benn Stancil
0:13:32
So generally, I would say like, the tool needs to work on the other aspects and recognize that a lot of the data engineers are building these products or either internally or as vendors, there's lots of very complicated problems they have to work on. And I think most analysts would recognize that as well. I think the responsibility doesn't necessarily lie. And for the, for the engineers building these tools to go out and determine all this stuff on their own and not get the help of their customers to go to tell them to do that.
0:14:00
I think that that the thing that is the responsibility of the tool builders is more just having the empathy of
0:14:07
the customer that using that it's less about, you know, I need
0:14:09
to go figure out what are these these little things is usability issues or other things like that, that are going to be the things that get in the way of my customer using this every day? Yeah, they should be, they should be sort of more which is willing to listen, when somebody has that feedback, and and recognize that those sorts of things are also things that will affect how somebody uses the tool that they build. So I think, you know, you can't again, as within any product, I don't believe you can build something that's a a technical Marvel, if it's not something that people want to use, there's there's plenty of examples of of tools and companies that have done this in a focused on, you know, if I build this, this monument, to some technical expertise, and people will come use it, well, you know, not really like, like people will use it because it helps them solve a problem. And and while you need to be able to figure out a way to balance those two things, I do think it's there's there's some empathy to to the customer summer that's necessary there of what does it that makes you want to use this thing every day yet, all of the sort of upstream technology that requires it is very important, obviously, if there's super organized and super clean and super sort of well defined data, but there's not any data in there, because nobody actually was able to get the data from the source into the to the warehouse or whatever, obviously, nobody's gonna use that either. But I think it's, it's, you know, it's important to keep in mind that usability matters and and you have to have the empathy for for your customers as your as you're building this product.
Tobias Macey
0:15:28
And ultimately, from the few things that we've brought up as examples, they're fairly small changes that don't really include any additional technical burden on the people building the pipelines, it's just as you said, a matter of empathy. But from your experience of working with other data engineers, and with your customers and different data teams, what are some of the common points of contention that you have seen arise between data engineers and data analysts as far as how they view the overall workflow or lifecycle of the data that they're trying to gain value from?
Benn Stancil
0:16:00
So So yeah, the examples were also were simple ones, I think there are places to especially this happens more, I think, in in the internal use cases, where there are more complicated things that are, you know, an analyst is kind of trying to frequently solve a problem in a particular way. And maybe they want that or mapping software or something like that, because because that's a way like exact often ask questions, and they just want a quick way to be able to visualize something on a map, but rather than having to format it, take it away, and load it into Tableau and do that. And so there may be more complicated things there, which I think is is again, kind of a product question for the data, our data engineers, rather, to figure out how hard is that to build and how much value is really providing an understanding, again, kind of the use cases behind the request in terms of in terms of these, like, sort of points of conflict, or where people can align? I think the one of the things I think that's that's really important is for data engineers to understand how data scientists and analysts think that that again, it's really understanding your customer. But it's not just understanding, I need to be able to deliver dashboards to executives and answer these challenging questions. It's, it's understanding kind of who your customers are and where they come from. And I think there's a couple a couple of big things that that are sort of define a lot of analysts that I think are, like, critical for for thinking about how you build tools for them. What is there they're trying to solve problems, like quickly and often trying to, to answer questions quickly, and kind of rough ways that, that they'll get a question from an exact it's like, why is this marketing campaign not working? And they're not trying to answer that, scientifically, they're trying to turn around something so that the business can make a decision and to an engineer or to a statistician, or to somebody who's who's, you know, focused on on building robust tools, the way that an analyst work, work may look sloppy, it may look like something where they're not crossing teeth, or not dotting I's, you know, they're very quickly trying to do something rough and sort of hacking their way to a solution. But in a lot of cases, that's the whole point like that is the value of an analyst does is is take complicated questions, distill it down to something pretty quick, ship it off to somebody who's making a decision and help the business make a decision and move forward. And so in a lot of ways, I think the the, the ways that the tools get built and the ways to sort of remove friction, or to understand not just the problems are trying to solve, but the kind of mindset behind it, which is this. All right, I have a question. How do I answer it? How do I like draw conclusions from these observations? It's not how do I build a logical model? How do I build sort of the most mathematical thing possible? How do I abstract a complex system, it often feels kind of rough and sloppy, but to an analyst. That's,
0:18:37
that's the job. And
Tobias Macey
0:18:39
so far, we've been talking about the API between data engineers and data analyst as being the deliver data probably sitting in a data warehouse somewhere. But on top of that, you've also built the mode platform. And there are other tools such as read ash, or Supercell, that exists for being able to run these quick analyses and be able to write up some sequel, get a response back, maybe do some visualization. I'm wondering as far as the way that you think about things and also the way that you've seen customers approach it where the dividing line is, in terms of the platform and tooling that data engineers should be providing, versus where the tooling for being able to perform these rapid analyses lives in terms of who owns it, and who drives the overall vision for those projects.
Benn Stancil
0:19:27
Yeah. So I think that kind of has to be a joint a joint effort, I, one of the failure modes here, I think can be just kind of throw it over the wall, you know, you build the tools, I'll consume the tool, where these two teams are tightly synced, that that I think it's important for them to be able to sort of have similar focus on the same problems that that it's not, it's not just for like data engineer shouldn't just be thinking about my objective is to build a tool. I'm a believer, that data engineer should sit very closely to data scientists or data analysts, and basically have the same, the same KPIs. The the objective should be how do we answer the questions we as a business need to answer and a data engineers job is to to enable that their job isn't to say, like, all right, I've, I've, you know, hit my KPIs because I delivered a tool, they should be trying to serve the same needs as the data scientist. And so I think it's, if you end up with the kind of, Okay, we'll build a tool, and somebody else will, will take it and consume it, I think you end up with this disconnect, where where the analyst aren't able to actually like deliver the quality analysis they need to deliver. And their ends up being a lot of friction at that at that, like touchpoint, between the two, because analysts are looking for a particular thing, they come back to the data engineers asked for data engineers, you know, feel like they're being sort of told what to do. You want to be able to allow these groups to be a little bit more autonomous, I think the only way you could do that is allow them to be invested in the same result. So it's, it's enabling engineers to understand like, what is the value of that product? Not just to the analysts, but how does it provide value to the to the broader entire users around the company? And so in cases, like super sad, take the if I'm a data engineer building, you know, implementing superset and a company, I want to understand not just Okay, great. They want superset plugged into these database data. It's what questions are you trying to answer to that? You know, at what frequency? Do you need to do it? Who's the customer of those customers? questions you're trying to answer? All those things help drive some of the decision making behind, you know, how do I Where do I? How do I get it up and running? Is it something that everybody needs to have access to? Is it something that just analyst needs to have access to and something needs to be shared easily, there's a lot of a lot of work there, I think that is in this gray area between the two and I think those those two groups need to be like open to that gray area, rather than sort of perfectly defined, you work on just these things, I work on just these things.
Tobias Macey
0:21:39
Yeah, in some ways, it's changing the definition of done where in one world, the definition of done for a data engineer is I've loaded the data into the data warehouse, I have washed my hands of further responsibility, it's now in somebody else's core versus the definition of done is I was able to get the data from this source system all the way to the you know, chief executive who needs to be able to use the resulting analysis to make a key decision that will drive our business to either success or failure and aligning along those business objectives rather than along the sort of responsibility objectives, which is one of the recurring themes that's been coming up a lot in the conversations I've had on this show, as well as a lot of the themes that have been arising in the division between software engineers and systems administrators that lead to the overall sort of DevOps movement, and just a lot more work in terms of aligning the entire company along the same objectives, rather than trying to break them down and along organizational hierarchies.
Benn Stancil
0:22:36
Yeah, I agree. And, you know, I think what one, this is a kind of, again, getting back to some small details, but one example of this actually comes to mind where this this broke down was it had a team like a data drink team that was very much focused on loading data to warehouse and analytics team that was responsible for like taking that data and passing it off to somebody else and answering questions with it. And there was like, a column name, and I believe the column that was updated at because I think that's like, it's just like a system table that, you know, a lot of like Rails apps and things like that have updated at timestamps that their their system generated. That was that was put into the warehouse, because that was what it was to an engineer, it made perfect sense. They put it in there, it was like, it's clearly named all of that to an analyst or data scientist, they interpreted that to mean something different without the kind of like understanding where they're trying to go with the problem downstream from it. It was a it was like a clean handoff, that to both sides is like updated, I know exactly what that means. Both sides thought they knew exactly what it meant. And then they ended up creating these analyses on top of that, with the assumption that that column means something that didn't mean and just by having this like, okay, you take the ball and taking that you run with it, and and I'm sort of, like you said, wash my hands of it, and then I'm driving from a very bad result. And so that probably could have been fixed very easily. But because the team was wasn't focused on kind of the end result of the business objective, the data engineers never actually saw the analysis that was being produced, they never really understood the questions that were being asked, they assume, okay, you're using it the way you should be using it, instead of stepping back and saying, like, All right, let's kind of work on this the actual problem together to make sure that we're solving this problem in the right way.
Tobias Macey
0:24:04
This is where another one of the current trends comes into play of there being a lot more focus on strong metadata management and data catalogs and being able to track the lineage of the data so that you can identify Oh, this updated at column wasn't created for the data warehouse that actually came from the source database. And I understand now a bit more about the overall story of how this came into play versus having these very just black box approach of all I know is, this is how it is now. And then also, in terms of metadata management, being able to know how frequently is this data getting refreshed? When was the last time it was actually updated in the data warehouse versus whatever this updated at column is supposedly pointing to that I know, potentially misinterpreting?
Benn Stancil
0:24:48
Yeah, I have a, I have a somewhat negative view of documentation on these things. I think that that is a noble goal. But it's really hard to maintain that, that if you have sort of manually created documentation was sort of data dictionaries, where people write down you know, this column means that and stuff like that, it's, it's often really hard to keep that that up to date, like people are now adding data sources. So quickly, data is evolving so quickly, that that often will lag. And in a lot of cases, a data dictionary that's out of date is more dangerous than no data dictionary at all, because people will go to it, oh, here's the data dictionary, I assume that this is correct. And, and it's actually a month behind where it should be. And so people are, are confidently making a decision off about a date information. Rather than looking at it with a little bit of skepticism. This is actually in my mind, a problem that hasn't really been solved. I know there's some some vendors out there that are attempting to do this. But what's the company called, I know, blanking on the name, but but there's a company that is attempting to do this through kind of using the patterns of how people are actually using data to document it sort of this automatic documentation that happens in the wake of of usage, rather than people having to manually create it. I think that's whether or not that technology works, sort of remains to be seen. But I think that is the right way to think about documentation where documentation is really
0:26:03
a
0:26:04
product of the way that people use something. And really, the way that what I have joined teams or had new folks join teams, the best way that they learn about how different pieces of the data schema works or the arc, like how things are connected together, is often from seeing how problems have been answered by other people and mirroring that it's it's like documentation based on on actual usage and documentation sort of centered around the the ways that people define concepts, rather than documentation based on some giant Excel file that is like this column is of this type with this data, there are a few folks I've seen that pulled that off, but for the most part, it just becomes a huge time sink to invest in it. And something that almost always ends up lacking. So that that is a tricky problem. I you know, I think that's something that that over time folks may figure out. But definitely one of those things that for now has is almost has to be a little bit of we learned by doing rather than rather than we learned by leading a manual to know exactly what these things are, there probably are some places where you could you could include sort of like a common pitfalls type of thing of like, don't use this updated at timestamp or this thing it says month to month, but in reality it's not don't trust it, you know, this sort of little little like, gotcha kind of things but but like a broader documentation is something that we haven't seen anybody implement terribly well, to this point.
Tobias Macey
0:27:22
I definitely agree that the idea of the static documentation as far as this is where this comes from, this is how you use it is grounds for a lot of potential error cases, because of as you were saying it becoming stale and out of date and no longer representation of reality. I was actually thinking more along the lines of the work that the folks at Lyft are doing with Amundson for being able to use that for a data discovery and having a relevance ranking as far as how often it's being used, or the work that we work is doing with Marquez, where they're integrating a metadata system into their ETL pipeline so that it will automatically update the information about what a table was last loaded from source? And what are the actual steps that a given piece of data took from the source to the destination where you could look up the table in the data warehouse and then see what are the actual jobs that brought it there? And when were they last run? And were there any errors to be able to get a better context, from the end user perspective, as far as what was the path that this data took so that I can have a better understanding about where it came from? And how I might actually be able to use it effectively?
Benn Stancil
0:28:29
Yeah, I think yeah, I think those are super interesting projects. And and there's a recently, a company called elemental released an open source tool called Dexter, that kind of follows in that same same pattern of trying to create, make pipelines that look a little you know, you were able to sort of parse your way through the little better and diagnose kind of, Oh, this thing went from step one to step two, step three, and I think that that stuff, I think it can be super interesting for for analysts and data scientists, because one of the one I think the big missing pieces in data stacks, and it's it may be solved by these maybe not is, if I'm working on a question or if I get asked a question, I'm start investigating some data and something looks something looks awry, like something doesn't quite look right. There's always a little bit of the back of my mind that makes me think, I wonder if this is a data problem. And and you're never able to quite escaped that. And and part of these, I think that's true is like pipelines are are notoriously fragile, you're always going to like miss some data, there's always little things that that you, you have to go through this process of like, this, this result doesn't quite make sense. I wonder if I'm double counting something because this one to one mapping that I thought was in place actually isn't that something got written away where we thought we had, you know, one Salesforce opportunity per customer, but it turns out, a second one got created somehow. And you have to kind of go through this, this sort of down this rabbit hole of of checking your data in various ways. It's not just like, was it loaded properly, but all of these other like unit test type of things that I don't I don't know, necessarily how you quite avoid it. They're probably technologies to build to help a lot with that. But there's, there's nothing really in place that that gives an analyst or data scientist, once they look at something for confidence that yes, this is this is something that I understand I know exactly what it is. And I need to investigate the business part of the problem that this data is telling me rather than well, should I like check and make sure everything is working before I before I go too far down trying to understand why the thing happened that I think may or may not have happened. And so any step to me that moves in that direction, whereas Amundson whether or not it's it's the thing that folks that we worked folks are building, whether or not it's it's sort of a unit test type of type of tools, that DB T is build all those things, I think that provide a little bit more confidence. And these are, I can now check off some things on the list that I was going to have to go check to make sure things are right, the faster you can get through that the faster as an analyst, I can focus on again, solving the business problem, rather than kind of bending your head against like, Is it a data problem? Like what do I not know, before I before I want to go take this to an exact get him inside? Oh, my God, you know, look, our revenues doing great this quarter. And then you know, the last thing you want to come back is like, well, you actually had a data problem. And that thing that I told you wasn't true, because I, you know, failed to investigate this one pipeline that did a thing that I didn't expect. That's
Tobias Macey
0:31:12
actually a great point to to be made, as far as the relationship between data engineers, and data analysts and ways that data engineers can help make the analyst job easier is actually making sure that they are integrating those quality checks and unit tests and being able to have an effective way of exposing the output of that as well as incorporating the analyst into the process of designing those quality checks to make sure that they are asserting the things that you want them to assert. So in the context of the sort of semantics of distributed systems, there's the concept of exactly once delivery or at least once delivery or at most once delivery, and that understanding how that might contribute to duplication of data or missing data, and what are the actual requirements of the analyst as far as how those semantics should be incorporated into your pipeline? And what should you be shooting for? And how are you going to create a certs whether it's using something like DVT, as you mentioned, or the Great Expectations project, or some of the expectations, capabilities that are built into things like data lake, and then having some sort of dashboard or integration into the metadata system, or a way of showing the analyst at the time that they're trying to execute a query against a data source, these are the checks that ran, these are any failures that might have happened so that you can then take that back and say, I'm not going to even bother wasting my time on this, because I need to go back to the data engineer and tell them, this is what needs to be fixed before I can actually do my work.
Benn Stancil
0:32:32
I very much agree with that. And I think that there's a lot, a lot of time get sunk into, you know, there's the common lines of like data, scientists spend 80% of the time cleaning data and all that, I think that that number obviously varies a lot. And for folks who are using machine generated data, you know, if you're using data, that's event logs and things like that, you don't spend that much time cleaning day, like machine generated data is not particularly dirty in the sense of, I have to, you know, clean up a bunch of political donation files that are all like, manually entered, like that's dirty data, where you have to figure out a way to take these 15 dresses that are all supposed to be the same and and turn them into one. But machine generated data doesn't have that problem. But it has a problem of, can I rely on it? Like did it? Did it fire these events properly? Did it fire them 10 times when I was supposed to fire them once? Did it miss a bunch?
0:33:17
And so anything? I
0:33:18
think that yeah, it can help sort of that cleaning problem of of understanding exactly what I'm looking at, and how how much does my data represent the truth is something that you always get as an as an analyst, you always have this in the back of your mind that like, this isn't quite truth. This is this is the best we have. And in a lot of cases, I think it's close. But I always have to be a little skeptical that it's that it is truth. And so the pieces there that that can help are a big value. Another thing too, that I think actually data engineers can do a lot for for analysts and data scientists as well is like, engineers have solved a lot of these problems or have thought about solutions for solving these problems, in ways that folks that come up through through sort of activity, little channels into being an analyst or data scientist happen. So like, one of the interesting things about about data scientists roles is people come from all different backgrounds, like you'll be working with someone who's, who's a former banker, who, you know, it's super deep and Excel, but it's just learning sequel, you'll be working with someone who's a PhD, who's written a bunch of our packages, but has never actually used production warehouse, you'll be working with someone, as a former front end engineer who got a data visualization, you'll be working with an operations specialist who's been writing Oracle scripts for 10 years. Like there's no consistent skill set for where these folks come from. And so the idea of even writing something that amounts to a unit test or writing something that lets you that the concept of version control with get are also things like those sorts of concepts aren't things that are necessarily going to come naturally to folks and analytics and data science. And so I think there's there's places where data engineering can kind of push some of those best practices on to to the way that analysts data scientists work. And I think this is data engineers can do it, vendors can do it, there's lots of different ways that we can, we can sort of standardize some of that stuff. But there's definitely a those sorts of practices, I think that that can come from the engineering side of kind of this, this ecosystem, to really encourage folks to be able to learn those things are folks be able to push in that direction to learn some of the pieces there that are valuable for their jobs.
Tobias Macey
0:35:14
Another element of the reliability and consistency aspect, from the analyst perspective, when working with the data is actually understanding when the process of getting that data has changed. So you mentioned things like version control. So if I push a release to my processing pipeline, that actually changes the way that I am processing some piece of data, and then you run your analysis, and all of a sudden, you've got a major drop in a trend line, and you have no idea why you're going to be spending your time tearing your hair out of maybe I did my query wrong, or you know, maybe something else happened, and just being able to synchronize along those changes, and making sure that everybody has buy in as to how data is being interpreted in process so that you have that confidence that when you run the same query, you're going to get the same results today, tomorrow and the next day. And then all of a sudden, when that expectation is not met, you start to lose trust in the data that you're working with, you start to lose trust in the team that is providing you with that data. So just making sure that there is alignment and effective communication among teams for any time those types of things come into play, I think is another way that data engineers can provide value to the analysts.
Benn Stancil
0:36:22
Yeah, like it as a very concrete example of this. And that trust, I also think extends further and I think there's if as a data engineer, again, if your goal is is to build a an organization that's focused on the products that you're building, or that has the mentality that the products are building matter, you also need to think about that, because so as a concrete example of that, say, you have like a a revenue dashboard and revenue and companies, we've worked a lot of companies like designing and figuring out how much you make, it seems like it'd be the simplest thing. It's like one of the most important numbers in a company has, it's always impossible, nobody does this. Well. It's always like, there's a ton of these weird edge cases, data coming from tons different places it's coming from a system is coming from Salesforce, which has all this manual entered stuff. So it's this kind of nightmare of of a process to just figure out, like, How much money did we make, but say you say you have like a revenue dashboard. And it says today that you made a million dollars last quarter, and then tomorrow, it says you made one and a half million dollars last quarter. And as it as like an analyst, that's that's your nightmare scenario, because now an executive saw this, they don't know which numbers right? They're mad at you. Because why in the world of this big change, they just told the board it was a million dollars. And now it's like a million and a half and like, Is it going to go to 750? Tomorrow, and nobody's gonna know what happened. And so you end up having to dig through so many different pieces of that. It's like, did you write a bad query? Did a sales rep go into Salesforce? And like, Oh, they backdated a deal that that actually signed today? was their data entry problem. And Salesforce or somebody put in something wrong? You know, did you double count something that you didn't mean to double count? Was there a data pipeline problem where data got updated in a bad way? Like all of those things end up becoming just this part? How do I figure out what happened and often you don't have any real record of what the system was before you just know, it used to say, a million and now it has one and a half. And you're like, I have to figure this out. And and those are the types of problems that are the headaches for for data analysts and the ones that you end up finding yourselves in all the time. And the more systems that you can have in place that lets you say, Oh, yeah, it's one and a half, because we've acted in a deal or because you know, this other thing happened it, the faster you can add to that, the easier Your job is, but kind of more importantly, the more trust that the rest of the organization will have in your job, because you're not spending all this time trying to like explain a number and not sure which one you actually want to stand behind
Tobias Macey
0:38:32
from the perspective of somebody who is working so closely with data analysts and with companies who have data engineering teams, as well as consuming some of these managed services for data platforms. I'm curious what you have seen in terms of industry trends that you're most excited by, from your perspective, as an analyst and some of the things that you are currently concerned by the you would like to see addressed.
Benn Stancil
0:38:59
So I think that we've seen like, this isn't, this isn't really a technology, but
0:39:04
I think it's one of the places where the business sort of industry can go that generally industry has made like very big strides in enabling Kind of day to day data driven decision making, that we've done a lot about, you know, how do we get data in front of in front of people around the business? How do we get it to them quickly? How when I am making a decision as a sales rep, and who do I call today, or a marketing manager, you know, which campaigns do I focus on? How do I how do I do that? Like, how do I make that decision? And I think we've made a lot of progress in that. In that front, one of the places where I think we now as an industry can sort of start to turn and focus more on is, businesses aren't driven by these daily optimizations and businesses don't win because they made the right daily optimizations, it certainly helps. But the big bets are often the things that determine the winners, like Jeff Bezos has this line, that in business, you want to swing for the fences, because unlike in baseball, where the best you can do on a single Swing, swing score for runs, and business, if you if you, you know, take the right swing, you can try thousand runs, like there's such a long tail of positive outcomes from from decisions, that it's worth it to take some big bets and make these big bets, because the outcomes of those can be sort of way better than any kind of like small optimization. And I think that, that we haven't really had data technologies, it's focusing on people figuring out how to make those big bets. There's a lot of like exploratory work that analysts do to really try to understand like, what is the big bet that we should make? And I think that's, that's one of the places that that I'm excited for folks to be able to go next is not just all right, we are data driven, because we have dashboards, we are data driven, because everybody's able to look at a chart every day, but how do we become data driven about the big bets? How do we become and that that I think, is really, how do we enable data analysts to data scientists, to to answer questions more quickly to be able to explore which big bets work out like, ultimately, the way that I think he went on making big bets is by being able to make more of them, making them smarter. And so the way I think you can do that is basically like, more quickly research these problems and understand what might happen if you make these changes. And those are the things that a dashboard will tell you. Those are the things like in depth analysis will tell you. And so I you know, as an industry, I think that's that's a lot. One of the places we can go next is not just enabling again, the the how do we optimize small things, but how do we how do we like uncover the really big opportunities
0:41:27
that are still very much kind of a boardroom type
0:41:29
of conversation these days?
Tobias Macey
0:41:31
And are there any other aspects of the ways that data engineers and data analysts can work together effectively that we didn't discuss yet that you'd like to cover? Before we close out the show?
Benn Stancil
0:41:42
I think, I think it mostly stands behind, knowing your customer. And knowing the problems are trying to solve that it's really getting into to knowing exactly what it is that that they're trying to do and how they use the products that you build. That again, this is I think learning from from engineers or not from engineers, getting from product managers and designers
0:42:04
is a super valuable thing.
0:42:07
Because because they've that's the problems they solve is learning from their customers. And I think that data engineers can kind of a lot of ways do the same thing. One other thing that I think is a place potentially where data analysts or data engineers, and so this isn't really necessarily engineers, but a place where organizations can think a little bit about data engineering today is don't hire folks too early trust in handy, I think, I think I stressed and wrote a blog post about this, that was basically focusing on this idea that, that a lot of organizations will think they need to date engineer before they do. And with all these tools out there now with the UTM tools with sort of point and click warehouses that are super easy to set up with how well these tools scale, that stitch and five trade and those sorts of tools can scale to pipelines that are plenty big for most companies, snowflake can be queried redshift can scale to a database, there's plenty big for most folks, it's like your data engineering problems are often not going to be that interesting or complex. Their problems that that having someone who who owns is important, but they often don't need to be someone who's a dedicated data engineer. And I think there is a way in which companies can like hire for a data engineer too quickly, because because they deserve a role I think they need but the data engineer ends up basically being an administrator of a bunch of third party tools. And that's that's not a role that a data engineer wants, as an engineer wants to solve hard problems, they want to be able to work on interesting stuff. They don't want to be someone who's, you know, checking to make sure that that stitches running every day or that your airflow
Unknown
0:43:32
pipelines, you know, check the
Benn Stancil
0:43:34
dashboard. Yep, air flow ran again, looks good. Like, that's not that interesting of a problem. And I think it's being sort of honest about what the role is, and where you need people to come in, before you actually hire someone who's who thinks they're coming in to, to figure out how to scale this part cluster to something huge, when in reality, they're like, just checking stuff like your snowflake credits every day and being like, okay, we're still using it at the same rate that we need to use. So I think that that's a that's kind of a big shift is it you can get pretty far with the out of the box stuff now.
Tobias Macey
0:44:02
So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Benn Stancil
0:44:18
So I think one is is the piece that we've talked about of their How do you monitor pipelines and not monitoring and a
0:44:25
strict sort of DevOps sense,
0:44:27
but monitor in knowing again, when I have a question and I see something out of place, I can very quickly tied out whether or not it was because I did have changes whether or not it was because some assumption that I made got it validated whether or not it was because a data pipeline didn't work, or a pipeline ran in a way that was an ordering That was unexpected, all those sorts of things, I think are super valuable, and save analysts tons of time, from actually having to dig through kind of the weeds of these problems. There's another place that I think we're starting to see some movement. But we still sort of don't have a real solid four, which is a centralized modeling layer,
0:45:03
essentially, that
0:45:04
when you think about how data gets used around an organization, it's not consumed by one application, that as a business, I have a bunch of data, typically, folks now can centralize that data into into data lakes or warehouses, putting it all into s3, putting Athena on top or snowflake or whatever, but then you have to consume that and say you want to you want to kind of model like what is a customer that's, that's a problem the sort of a traditional bi type of problem. But most BI models are models only operate within the big application. And because data now is spread so much to an organization, the model of what a customer is, is something that needs to be centralized, it needs to be something that's available to engineers who are using API's to you're pulling data out programmatically to define something in the product that needs to be available to data scientists who are building models on top of that to forecast you know, revenue or build in product recommendation systems. And it will be available to an executive who's looking at a BI tool to understand how many new customers every day, like all of these different applications require this kind of centralized definition of what is a customer and and a tool like DBT is kind of moving in the right direction. But there's still not a great way to kind of unify concepts within a data warehouse like that. And in a way that it can be consistent. So you end up as as someone consuming that data having to rebuild a lot of these concepts of different ways. Or it which which ends up kind of creating all sorts of problems of what is a customer in one place isn't quite what's the customer another, I don't think we've quite figured out that that part of the layer, like we have the warehouse layer, there's very robust applications that that can sit on top of the warehouse. But they all kind of feed into the warehouse through different channels and through sort of different business definitions of what this data means. And without that centralized layer, you're always going to have some confusion over over these different definitions.
Tobias Macey
0:46:48
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it mode and your experiences of working with data engineers and trying to help bridge the divide between the engineer in the analysts. It's definitely a very useful conversation and something that everybody on data teams should be thinking about to make sure that they're providing good value to the business. So I appreciate your time, and I hope you enjoy the rest of your day.
Unknown
0:47:13
Thanks, guys. Thanks for having me.
Tobias Macey
0:47:20
Listening, don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways that is being used, and visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast com with your story. And to help other people find the show. Please leave a review on iTunes and tell your friends and co workers

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Summary

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
    • What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
  • Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
  • Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
  • For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
    • How does HPCC Systems manage persistence and scalability?
  • What are the integration points available for extending and enhancing the HPCC Systems platform?
  • What is involved in deploying and managing a production installation of HPCC Systems?
  • The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
  • How does the Thor engine manage data transformation and manipulation?
    • What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
  • For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
  • How are you using the HPCC Systems platform in your work at LexisNexis?
  • Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
  • How is the HPCC Systems project governed, and what is your approach to sustainability?
    • What are some of the additional capabilities that are only available in the enterprise distribution?
  • When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
  • What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
  • What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
  • What do you have planned for the future of HPCC Systems?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:14
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity in the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register. Your host is Tobias Macey, and today I'm interviewing Flavio Villeneuve, about the HPC project and his work and his work at Lexis Nexis risk solutions. So Flavio, can you start by introducing yourself,
Flavio Villanustre
0:01:36
of course to be so my name is fluid ministry, I'm Vice President of technology and ca. So for Lex and XV solutions. We in the electronics solutions, we have a data platform called the HB systems platform, we have made it both from open source in 2011. And since then, I've been also as part of my role involved with leading the open source community initiative. ensuring that the open source community truly leverages the platform helps contribute to the platform, and certainly creating a liaison between the next Nexus solutions organization and the rest of the larger open source community.
Tobias Macey
0:02:19
And do you remember how you first got involved in the area of data management
Flavio Villanustre
0:02:22
thoroughly has been seamless, and it's probably in the early 90s been going through the database to the database analytics to data management to Data Integration? Keep in mind that even within Lex Nexus, we started the HP systems platform back earlier before year 2000. So back then we already had data management challenges with traditional platforms. And we started with this. And I've been involved since I joined the company in 2002, just with HPC. But before I've been in data management for for a very long time.
Tobias Macey
0:02:57
And so for the HPC system itself, can you talk through some of the problems that it was designed to solve and some of the issues that you were facing at Lexis Nexis that led to its original creation?
Flavio Villanustre
0:03:10
Oh, absolutely. So in Mexico, we solutions, we started with management, I say, our core competency back in the mid 90s. And as we go into risk management, one of the core assets when you are trying to assess risk, and predict outcomes is data. Even before people spoke about big data, we had a significant amount of data, mostly structured, semi structured data to but the vast majority structured. And we used to use the traditional platforms out there, whatever we could get our hands on. And again, this is old, back in the day before Hadoop. And before MapReduce was applied as a distributed paradigm for data management or anything like that. So databases, Sybase, Oracle, whatever was Microsoft SQL, data management platforms of initio information, whatever was available at the time. And certainly the biggest problem we had with a scalable, but was twofold one was the scalability, all of those solutions typically run in a single system. So there is a limit to how much bigger you can go vertically. And certainly, if you're trying to also consider cost affordability of the system. And that limit is that is much lower as well, right, there is a point where you go beyond what the commodity system is, and you start paying a premium price for whatever it is. So that was the first piece. So one of the one of the attempts of solving this problem was to split the data and use different systems but splitting the data, it creates also challenges around data integration, if you're trying to link data, surely you can take the traditional approach, which is you segment your data into tables. And you put those tables in different databases, and then use terms of the foreign key to join the data. But that's all good and dandy as long as you have a foreign key that is unique, handheld is reliable. And that's not the case with data that you acquire from the outside. If you didn't read the data, you can have that if you bring the data from the outside, you might have a record that says these records about john smith, and you might have another record that says this liquid Mr. john smith, but do you know for sure he does. Two records are about the same john smith. And that's, that's a Lincoln problem. And the only way that you can do Lincoln effectively is to put all the data together. And now you have a we have this particular issue where in order to scale, we need to segment the data, in order to be able to do what we need to do, we need to put the data in the same data lake as a dome team. Later, we used to call this data land, eventually we teach it term in the late 2000s. Because data lake become became more more well known. So at that point, the potential bats to overcome the challenge where well, we either split all of the data as we were before, and then we come up with some sort of meta system that will leverage all of these 3d data stores. And potentially, when when you're doing prolific linkage, you have problems that are in have the computational complexity always square or worse. So that means that we will be a significant price and performance but potentially can be done if you have enough time and your systems are big enough, and you have enough bandwidth in between the systems. But the complexity you're gaining from a programming standpoint is also quite significant. And
0:06:33
some things you don't
0:06:34
have enough time some things you get data updates that are maybe hourly or daily. And the doing this big linking process may take you weeks or months if you're doing this across different systems. So and the complexity in programming, this is also pretty significant factor to consider. So at that point, we thought that maybe better approach to these was to create them. So defend an underlying platform to apply this type of solutions to problems with algorithms in a divide and conquer type of approach, we would have something that would partition the data automatically. And that will distribute the data in partitions into different commodity computers. And then we would add an abstraction layer on top of it that would create a programming interface that gave you the appearance that you are dealing with a single system with a single data store. And whatever you coded for that data store would be automatically distributed to the underlying partitions. We would also because of the way the hardware was fighting slower than it is today, we thought that a good idea would be to move also as much of the algorithm as we could to those partitions rather than executing the centrally. So instead of bringing all of the data to a single place to process these, which the single place might not have enough capacity would be to do as much as you can for a couple of brief field operation or a distributed grouping operation or through the filtering operation across each one of the politicians. And eventually, once you need to do the global aggregation, you can do it centrally. But now with a far smaller data set that is already pre filter. And the time came to define how to build abstraction layer. The one thing that we knew about was SQL as a programming language. And we said, well, this must be something that we can track with SQL as a permanent interface for our data analysts. But they work with us quite used to a data flow model because of the type of tools they were using before. Things like a couple of an issue where the data flows are these diagrams where your notes are the operation. So the activities you do and the data and the lines connecting the flow, the activities represent the data traversing those. So we thought that a better approach than SQL would be to create the language that a gave you the ability to build this sort of data flows in that system. That's how easy it was born, which is the language that runs on WHVZ and HPC.
Tobias Macey
0:09:05
So it's interesting that you had all of these very forward looking ideas in terms of how to approach data management, well, in advance of when the overall industry started to encounter the same types of problems as far as the size and scope of the data that they were dealing with that led to the rise of the Hadoop ecosystem, and the overall ideas around data, lakes and MapReduce and some of the new data management paradigms that have come up. And I'm wondering what the overall landscape looked like in the early days of building the HPC system that required you to implement this in house and some of the other systems or ideas that you drew on for inspiration for some of these approaches towards the data management and the overall systems architecture for HPC?
Flavio Villanustre
0:09:52
That is a great question. So it's interesting, because in the early days, when we told people what we were doing, they will look as bad often asked, Well, why don't you use database x, y, z, or data management system XYZ. And the reality is that none of those would be able to cope with the type of data, frequent data process, they wouldn't offer the flexibility of the process, like this probabilistic record linkage that we that I explained before that we do, and certainly good an offer in seamless transition between data management and data forwarding, which was also one of the important requirements that we had a time, it was quite difficult to explain to others why we were doing this, and what we were gaining by doing this. So map and reduce operations as, as functional programming operations have been around for a very long time since the list Lyft days in the 50s. But the idea of using map and with us as operations for the data management didn't get published and build this, I think was September December 2004. And I remember reading the original paper from the Google researchers thinking, well, now someone else has the same problem. And they got to do something about it. Even though at the time we were already we already have HPC. And we already had the CL so it was a perhaps too, too late to go back and try to re implement the data management aspects and the and the programming layer abstraction on HPC. So just for those people in the audience that don't know much about the CL, again is this is all open source or open source and free Apache to license and there are no no strings attached. So please go there and look at it. But in summary, ECL is a declarative Dataflow programming language. And not unlike declarative manner, what you can find in an SQL or functional programming languages, Haskell maybe wait of Bremen, Lisp and closure and other permanent oh there. But if data flow, from their standpoint is closer to something like TensorFlow, if you are familiar with TensorFlow as deep learning, programming paradigm, and framework, so where you could data operations that are primitive, that our data primitives, like for example, sort, you can say sort data set by this column in this order. And then you can add more modifier if you want, you can do a join across data sets. And again, the abrasions join, and you can do a roll up operation and operations name roll up. All of these are high level operations, you define them in your program. And in a declarative programming language, you create definitions, rather than assign variables. For those that are not familiar with declarative programming. And so many are in this audience. The collective programming has, for the most part, the property of having immutable data structures, which doesn't mean that you cannot do valuable work. And you can do all of the work the same way or better. But it gives get gets rid of side effects and other pretty bad issues with a more traditional immutable data structures. So you define out to you to define things, I have a data set that has a phone book, and I want to define an attribute that is this data set, filter by a particular variable. And then I might define another attribute that uses the filter data set to now group it in a particular way. So at the end of the day, any single program is just a set of definitions that are compiled by your compiler. And these compilers is yelling to see which then men reality c++, which then goes into the c++ compiler of the system, this is your plan or whatever you have and generate assembly code. And that that is the goal that is run in the platform. But the fact that you feel is such a high level programming language, and the fact that is declarative means that the CL compiler can take decisions that otherwise more imperative type of programming language wouldn't allow the compiler to take the compiler in a declarative programming language. And functional languages is also in case knows the ultimate goal of the program, because the problem is, in some ways, is a morphic to an equation. And you could even line from a functional standpoint, every one of your statements into a single massive statement, which you of course, can do from a clinical standpoint. But the compiler can now for example, do things like apply non strictness, if you put a statement, if you made a definition that is never going to be used, there is no point for that definition to be even compiled in or executed that all that saves performance equal. If you have a conditional fork in a place in your in your code, but that condition is always met or never met, then there is no need to compile the other branch a all of these gives you performance implications that can be far more significant. When you're dealing with big data. One of the particular optimizations can be around data and calculation, it is a lot far, it's far more efficient than a lot faster, if you are going to do similar operations to every legislator said to combine all of those operations, and do only one person to data with all the abrasions if it's possible. And they combine laser compatible as exactly that. And, and takes away a little bit of the perhaps flexibility on the programmer by making it far more intelligent at the moment, it's compiled. Of course parameters can tell the compiler I know better and forced to do something that may be otherwise unreasonable. But a just an example. You could say, well, I want to sort this data set and I then I want to filter it out and get only these few records. And if you say that in that order there, a an embedded the programming language would first sort and sort of even in the optimal, most optimal case is it's an N login type of operation and condition of complexity, and then fill it out and get only a few records out of it, when the optimal situation would be to filter it out first, and get those few records and then sort those records and ACL competitors. exactly that.
Tobias Macey
0:16:01
The fact that the language that you're using for defining these manipulations ends up being compiled. And I know that it's implemented and C and c++, both the ACL language itself as well as the overall HPC platform is definitely a great opportunity for better performance characteristics. And I know that in the comparisons that you have available for going between HPC and Hadoop, that's one of the things that gets called out. And as far as the overall workflow for somebody who is interacting with the system using that ECM language. I'm curious if the compilation step ends up being in any way a not a hindrance, but a delaying factor as far as being able to do some experimental iteration or if there is the capability of doing some level of interactive analysis or interaction with the data for being able to determine what is the appropriate set of statements to be able to get the desired end result when you're building an ACL data flow?
Flavio Villanustre
0:17:05
Nice. Another great question, I can see that quite diverse
0:17:10
and programming. So you're right, the fact that the seal is compiled means that just again, for for the rest of the audience, we have an integrated development environment policy, like the and of course, we support other like Eclipse and Visual Studio and all of the standard ones, but I'll just talk about it, feel it because it's what I mostly use. In that case, when you write code, you write the ATL code, and then you do, you can certainly run the test of the Golden but if you verify that that gold is, is correct, syntactically, but at some point, you want to run the gold because you want to get it in, you didn't want to know if semantically makes sense, and it will give you the right results. Right so and running the gold we go through the compilation process, depending on how large your code bases, certainly the competition process can take longer. Now the compiler does know what can be modified. Remember, again, a Felisa declarative programming language. So if you haven't touched a number of attributes, and again, data structures are immutable, and add to use the DOM change, since there are no side effects should be exactly the same. So the fact that a when you define a function, that function cause referential transparency, that means that if you call the function at any time, it will give you the correct result, or the same result based on the parameter and just based on the parameter that you're passing with that the compiler can take some shortcuts. And if you are re compiling some bunch of UCL attributes, but you haven't done too many of them, it will just use the pre compiled code for those and only compile those of you have changed. So the completion process, when you are dead, delicately working on code tends to be fairly quick, maybe a few seconds, of course, you depend on having any car company find it available. Traditionally, we used to have a centralized approach to the Excel compiler, when it would be one or a few of them running in the system, we have moved to a more distributed model where when you deploy your refill ID and you refill tools in your workstation, there's a compiler that goes there. So the field completion process can happen in the workstation as well. And that gives you the ability to have it available for you at all times when you're trying to use it. The one of the bottlenecks was at some point before, when you were trying to do this quick adaptive programming approach to things and the compiler was being used by someone that was compiling a massive amount of PCs from some a completely new job, and may have taken minutes and you were does they are sitting, picking your nose waiting for the compiler to to finish that one completed. By the way, the time to compile this is an extremely important consideration. And we continue, we improved the compiler to make it faster. We we have learned you can imagine over a bit. By the way, some of the same core developers have developed the CL compiler governing holiday, for example, have been with us since the very beginning they he was one of the core architects became the initial design of the platform. And he's still the lead architect that is developing that ECM compiler, which means that a lot of the knowledge that has gone into into the compiler process and optimizing it is still getting better and better. Of course, now with the larger community working on the compiler and and more people involved and more documentation around it means that others can pick up where he leaves. But hopefully he will be around and doing this for a long time. And making sure that the compiler is as Justin time as it can be is is very there is no at this point interpreters for ECL. And I think it would be quite difficult to make it completely interactive where the point where you submit just a line of code and does something because of the way a declarative programming paradigm works, right.
Tobias Macey
0:21:17
And also, because of the fact that you're working most likely, with large volumes of data distributed across multiple nodes, being able to do a rebel driven development is not really very practical, or it doesn't really make a lot of sense. But the fact that there is this fast compilation step in the ability to have a near real time interactivity, as far as seeing what the output of your programming is, it's good to see particularly in the Big Data landscape, where I know that the overall MapReduce paradigm was plagued in the early years by the fact that it was such a time consuming process to submit a job and then see what the output was before you could then take that feedback and wrap that into your next attack. And that's why there has been so many different layers added on top of the Hadoop platform in the form of pig and hive and various sequel interfaces to be able to get a more real time and interactive and iterative development cycle built in.
Flavio Villanustre
0:22:14
Yeah, you're absolutely right there. Now, one thing that I haven't told the audience yet is how the platform looks like mine. And I think that this we are getting to the point where it's quite important to explain that there are two main components in the HPC systems platform, there is one component that does data integration, these these massive data management engine equivalent to your data lake management system, which is called for for is meant to run one PCL work unit at a time which a What can it can consist of a large number of abrasions and many of them are running parallel Of course, and there is another one which is known as Roxy which is the data delivery engine there is one which is a sort of a hybrid called AH for now Roxy an H store both are designed to operate in 10s of thousands or more operations at the same time simultaneously, for is meant to do one work unit at a time. So, when you are developing on for even though your completion process might be quick, and you might run on a small data sets quickly, because you can execute this work in it on those little data sets using For example, h4, if you are trying to do the data in large data transformation of a large data sets in your phone system, you still need to go to the queue in that for and you will get your time whenever it's due for you right, surely you can we have priority, so you can jump into a higher priority queue and maybe you are you can be queued after a the just the current job. But before any other future jobs, we also partition jobs into smaller unit. And those smaller units can be also segmented, they are fairly independent from each other. So we could even interleaved some of your your jobs into in between a job that is running by getting into each one of those segments of the of the work unit. But if they get active in this there is a little bit less than a than optimal, but it is the nature of the basis because you want to have a large system to be able to process this throughout all the data in a relatively fast manner. And if we were trying to truly multi process they are most likely many of the resources available, good suffer, so you may end up paying a significant overhead across all of the president or running in parallel. Now. I did say that full run only one working at a time. But that was a little bit of a lie. That was really a few years ago. Today, it does run you you can define multiple QC in a store. And you can make run 34 then work units, but certainly not thousands of them. So that's a that's a big difference between that and Roxy, can you run your work in it and Roxy, yes, or in each floor. And they will run concurrently with anything else that is running with almost no limit their thousands and thousands of them can run at the same time. But there are other considerations on when you run things on Roxy or H store versus in floor. So it might not be what you really want.
Tobias Macey
0:25:29
Taking that a bit further, can you talk through a standard workflow for somebody who has some data problem that they're trying to solve and the overall lifecycle of the information as it starts from the source system gets loaded into the storage layer for the HPC platform. They define an ACL job for that then gets executed and Thor h store and then being able to query it out the other end from Roxy and just the overall systems that get interacted with each other rage about data life cycle.
Flavio Villanustre
0:26:01
co I love to so very well let's let's set up something very simple. As an example, you have a number of data sets that are coming from the outside, you need to load those data sets into HPC. So the first operation that happens is something that is known as spray spray is simple process is an spray comes from the concept of spray painting the data across the cluster, right. So this runs on a Windows box or a Linux box and it will take the data set, let's say that your data set is just given number in million records long. It will unusual as it can be in any format, CSV or or any other or fixed length limited or whatever. So it will look at your data total data set, it will look at the size of the four cluster where the data will be saved initially for processing. And let's say that you have a million records in your data set and you have MN nodes on your for let's just make round numbers and the small numbers. So it will a petition the dataset into 10 partitions because you have to note and it will a then just copy transfer each one of those partitions to the corresponding to full node This is done. If it can be better lies in some way, because for example, your latest fix link, it will automatically use pointers and paralyze this if the data is in either no and XML format or in the limited format where it's very hard to find the partition points, you will need to do a pass in the data, find the friction points and eventually do the panel copying to the thought system. So now you will end up with 10 partitions of the data with the data in no particular order, the Netherlands, all of them that you had before, right. So the first 100,000 records will go to the first note the second 100,000 Records, we go to the second node and so on so forth until you go to the end of the data set this put each one of the nodes in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is spread or
0:28:10
while the data has been sprayed. And depending on the length of the data,
0:28:13
or or even before year, you will most likely need to arrive at work you need to work on the data. And I'm trying to do this example in a way that he said that data The first thing you see that data. So otherwise, all of these automated, right, so you don't need to do anything manually. All of this is scheduled and automated. And working that you already had will run on the new data set and will have appended or whatever it needs to be done. But let's imagine that is completely. So now you write your work unit. And let's say that your latest said was a phone book, and you want to first of all, and even a duplicate, build some rollout views on the phone book. And eventually you want to allow the users to run some queries on a web interface to look up people in the phone book. So you and let's just for the sake of an argument argument, let's say that you're also trying to join that phone book with your customer contact information. So, you will right they will connect that it will have that join to merge those two, you will have some duplication and perhaps some sort of thing. And after you have that you will need to build you will want you don't need to but you will want to build some keys. There is another again, key build processing for the oldest runs on for that will be part of your work unit. So essentially, it's all the CO writer working with ECL submit their work unit, they still will be compile will run on your data, hopefully, they feel will be syntactically correct when you submit it. And it will run with giving you the resource that you were expecting on the data. You see. I mentioned this before, but he says surgical type language as well, which means that it is a little bit harder to errors that will only appear in runtime between the fact that he has no side effects. And that is typically typed most typing errors, type errors they've made in errors and they might into function operations errors are a lot less frequent. There is not like Python, but you may
0:30:17
seem okay. The
0:30:20
run may be fine. But then one run at some point it will give you some we are there because a variable that was supposed to have a piece of text has a number to revise the verse. So you run the work in it, they will give you the result as a result of this work unit will give will potentially give you some statistics and the data some metrics. And he will give you a number of keys, those keys will be also partitioned in four. So there will be filtered nodes, the keys will be partitioning them pieces in those nodes. And you will be able to play those keys as well from for Joe, you can write a few attributes that can do the quoting there. But at some point you will run to you will want write those queries for Roxy to us. And you will want to put the date and Roxy because you don't have one user creating the data you will have a million users going to query that data and perhaps a 10,000 of them will be simple things liquidating. So for the process, you write a another piece of ECL another sort of work in it, but we call this query and you submit that to Roxy instead of four. And there is a slightly different way to submit it to Roxy. So you select Roxy and you submit this, the difference between this query and they work in it I do the heat you have in four is that the query is parameter raised and similar to paradise to proceed in your database, you will find some variables that are supposed to be coming from the front end from the input from the user. And then you just use the values and those variables to run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from for i said before the keys are not mandatory, Roxy can perfectly work without keys can even cast a way to work with in memory distributed data sets as well. So even if you don't have a key, you don't pay a significant price in they look at by doing the sequential look up on the data and the full table scans of your database. So you submit that to Roxy, when you submit that query to Roxy, Roxy will realize that it has the data and it's not in Roxy's in for and this is also your choice, but most likely you will just tell Roxy to load the data from for it will know what to all the data from because he knows what are the keys are and what the names of those keys are, it will automatically load those keys. And also your choice to the Roxy to a stair allowing users to query the front an interface or to a while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically exposed on the front end there is a component called ESP, that component called DSP exposes a web services interface. And this gives you a restful interface, a soap interface, JSON for the payload, if you're going from the restful interface, even AM an old EBC interface if you want. So you can have unit even SQL and front on the front end. So the moment you submit the query, the query automatically generate out to generates these, all of these web service interfaces. So automatically, if you want to go with a web browser on the front end, or if you have an application that can use I don't know a restful interface over HTTP or HTTPS, you can use that and it will automatically have access to that Roxy quality that you submitted, of course, a single Roxy might have not one query but 1000 different queries at the same time, a all of them leasing an interface, and it can have several versions of the same of the queries as well. The queries are all exposed version from the front end. So you know, what they use is an accent. And if you are deploying a new version of equity or modified and extinguish it, you don't burn your users, if you don't want to, you give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process. Now, as you have committed to these while you need to have automation, all of these can be fully automated. In ECL, you may want to have data updates. And I told you data is immutable. So every time you think you're mutating data, you're updating data, you're really creating a new data set, which is good because it gives you full provenance, you can go back to your everyday version, of course, at some point, you need to delete data, or you will run out of space. And that can be also automated. And if you have updates on your
0:34:36
data, we have concepts like super files where you can apply updates, which are essentially new overlays from the existing data. And the existing work unit can just work on that, happily as if it was a single data set. So a lot of these complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So the developers if they don't want to see the underlying complexity, they don't need to, if they do they have the ability to do that I mentioned before Well, ECL will optimize things. So if you tell it, do this, join, but before doing the join to the sword, well, you may know that it is to us or to the sort of won't be that. But a if you know that your latest resorted, you might say, well, let's not do this, or I want to do this join each one of our politicians locally, instead of a global join, and order they are the same thing with sort of disorder operation and ECL of course, if you tell it to do that, and you know better than than the system, you see, I will follow your orders. If not, it will take the safe approach to your operation. Even if it's a little bit more overhead. Of course,
Tobias Macey
0:35:47
a couple of things that I'm curious about out of this are the storage layer of the HPC platform and some of the ways that you manage redundancy and durability of the data. And I also noticed when looking through documentation that there is some support for being able to take backups of the information, which I know is something that is non trivial when dealing with large volumes. And also on the Roxy side, I know that it maintains an index of the data. And I'm curious how that index is represented and maintained and kept up to date in the overall lifecycle of the platform.
Flavio Villanustre
0:36:24
Those are also very good question. So in the case of for for him Cassie concept. So we need to go down to a little bit of a system architecture. So in Thor you have each one of the nodes that handle a primarily they are chunk of data, they are partition of the data. But there is always a body node, some other node that has also their own partition, but they have a copy of the partition of some other nodes. If you have 10 nodes in your cluster view your node number one, I have the first partition and my have a copy the partition that no den has no number two might have a partition number two, but also might have a copy of the partition that no no number one has, and so on so forth. every node would have one primary partition and one backup partition of the other nodes every time you run a work unit. He said that you did he mutable, but you are generating a new data set every time that you are materializing data on the system, either by forcing it to materialize or a by letting the system materialize the data when it's necessary. And the system tries to stream as much in this way similar more similar to spark or or TensorFlow where the data can be streamed from acuity to acuity without being materialized. And like my previous and at some point, he decides that it's the time to materialize because the next operation might require materialized data or because you've been going for too long with data that if something goes wrong with the system will be blown up with every time it materializes data, the lazy copy happening with a new data has materialized to these backup nodes. So surely there is there could be a point where if something goes very wrong, and one of the nodes dies and the data in the disk is corrupted, but you know that you have always another know that has ad copy. And the moment you replace you do with known as Khufu essentially pull it out put another one in the system will automatically revealed that missing partition because it has complete redundancy of all of the data partitions in all the different nodes in the case of Roxy. So in the case of Florida seems to be sufficient, there is of course, the ability to do backups. And you can backup all of these partitions which are just files in the Linux file system. So you can even back them up using any Linux backup utility or you can use HPC to backup for you into any other system you can have cold storage, some of the problems is what happens is where your data center is compromised. And now someone modified or destroyed the data life system. So you may have you may want to have some sort of offline backup. And you can all handle this in the normal system backup configuration, or you can do it the HPC and make it offloaded as well. But for Roxy, the redundancy is a even more critical in the case of for when a node dies, it is sometimes less convenient to let the system work in a degraded way. Because the system is typically as fast as the slowest node. If all nodes are doing the same amount of work, a process it takes an hour will take an hour. But if you happen to have one know the die that now there is one know that he's doing twice the work because he has to do deal with two partitions of data its own and the backup of the other one, the time to the process may take two hours. So it is more convenient to just stop the process when something like that happens. The note and let the system rebuild that note quickly and continue doing the processing. And that might take an hour and 20 minutes or 10 minutes rather than the two hours that otherwise you would have taken. And besides if a system continues to run and your drive your storage system died in one knows because it's old and there is a chance that either the storage systems, when they get under the same stress will die the same way you want to replace that one quickly and have a copy. As soon as you can do not run the risk that you lose two of the of the partitions. And if you lose two partitions that are in different nodes that are not the backup of each other, that's fine. But if you lose the primary node, and the backup node for that one, there is a chance that you may end up losing the entire partition which is which is bad. Again, bad if you don't have a backup and Leland returning back of some things next time. So it's it's also inconvenient. Now and the Roxy case, you are you have a far larger pressure to have the process continue. Because your Roxy system is typically explosive all to online production customers that may pay you a lot of money for you to be highly available.
0:41:06
So Roxy allows you to have define the amount of realness that you want. based on the number of copies that you want, you could say, well, I haven't been a Roxy and as as need, which is the default, a one copy of the data or I need three copies of the data. So maybe they copy the partition, we know the one will be will have a copy in two, or three and four, and so on so forth. Of course, you need four times the space. But you have a far Hager resilience, if something goes very wrong, and Roxy will continue to work, even if a nose is down or you know, top down or, or as many notes as you want that down as long as you have the data is still fine. Because worst case scenario if even if it was a partition completely Roxy Mike, if you want to continue to run, but he won't be able to answer any queries that we're trying to leverage that particular partition that he's gone, which is sometimes not a good situation, when you you ask about the format of the keys and the formatting, they have the keys of the indexes in Roxy is interesting, because those keys, which is again, typically the format of the data that you have in Roxy, for the most part, you will have a primary key, these are all keys that are multi field like in a normal decent database out there. So they have multiple fields a they go, typically they those fields are all over by cardinality. So the fields with the larger cardinality will be at the front to make it more better performing. It has interesting abilities, like for example, you're going to step over a field that you don't have, you have a Wildcat for and still use the remaining fields, which is not something that normally a database doesn't do. Once you have a field that you don't have a value to apply, the rest of the fields on the right hand side are useless. And those Mexico's other things that are quite interesting there. But the way the data is stored in those keys is by decomposing those keys into two components, there is a top level component that indicates which node will have that partition. And there is a bottom level component, which indicates Where in the hell drive they have that a of the node, the specific data elements or the specific block of data elements are. So by decomposing the keys in these two hierarchical levels, it means that every node in Roxy can have the top level of that which is very small. So every node can know where to contact the specific values. So every note can be quoted from the front end, you have now a good scalability on the front end, you can have a load balancer and load balance all of the nodes. And it still on the back end, they can go back and know which node to ask for this when I said that the bottom level has the specific partition, I lied a little bit because he's not been no number one uses multicast. So nodes, when they have a partition of the data they subscribed a multicast channel, what you have in the top level is the number of the multicast channel that will handle that partition that allows us to make Roxy nodes more dynamic and handle. Also the fault tolerance situations where nodes go down. Well, it doesn't matter if you send the message to a multicast channel. Any know that is correct, we get the message, which one to respond well, he will be there faster note they know that is less burdened by other queries, for example. And if any know dies in the channel, it really doesn't matter. You're not stuck in a TCP connection waiting for the handshake to happen because they know the wind the way it is UDP, you send the message, and you will get the response. And of course, if nobody responded in a reasonable amount of time, you can resend that message as well,
Tobias Macey
0:44:53
going back to the architecture of the system, and the fact that how long it's been in double element and use and the massive changes that have occurred at the industry level as far as how we approach data management and the uses of data and the overall system that we might need to integrate with. I'm curious how the HPC platform itself has evolved accordingly. And some of the integration points that are available for being able to reach out to or consume from some of these other systems that organizations might be using,
Flavio Villanustre
0:45:26
we have changed quite a bit. So even those HPC systems name and some of the code base is resembles what we have 20 years ago, as you can imagine, any piece of software, he's a living living entity, and changes and evolved under that I've got as long as the communities active behind us, right. So we have changed significantly, we have not just added functionality, core functionality of HPC or change the functionality I had to adapt to times, but also build integration points. I mentioned a spark for example, and spark. Even though HPC is very similar to spark. spark is a large community around machine learning. So it is useful to integrate with the spark because many times people may be using spark ml. But they may want to use HPC for data management. And having a proper integration where you can run a spark ml and have on top of HPC is something that can be attracted to a significant amount of the HPC open source community. In other cases, like for example, Hadoop and HDFS axes are the same way integrations with other programming languages. Many times people don't feel comfortable programming everything in the CL and ECM works very well for a Data Manager something that is a data management centric process. But sometimes you have little components in the process, for example, that cannot be easily expressed in ECL is not in a way that is efficient.
0:46:55
And I don't know, I'll just throw one little unit together unique, unique ideas for things and you want to deny this you it is unique IDs in a random manner like UUIDs.
0:47:06
Surely you can call this and ECL, you can come back and come up with some crafty way of doing UCL. But he would make absolutely no sense to go to Denise EL, to then be compiled into some big chunk of c++, when I can go to directly in C or c++ or Python, or Java, or JavaScript. So being able to embed all of these languages into ECL became quite important. So we built quite a bit of integration for embedded languages is back even a few very major versions ago a few years ago, we added support for a I mentioned some of his language already Python, Java, JavaScript. And of course C and c++ was already available before. So people can add this little snippet songs functionality create attributes that are just embedded language type of attributes. And those are exposing CLS if they weren't UECO primitives. So now they have the ability of this and expand the ability of the core language to support new things without need to write them in a CL natively every time. And other there are plenty of other enhancements as well on the front end side. So I mentioned ESP ESP is this front end access layer, think of it as a some sort of message box in front of your Roxy system. In the past, we used to require that you code your ACL query for Roxy. And then you need to an ESP source recorded in c++. So you need to go to ESP and extend ESP with a dynamic model to support the front end interface for that query, which is twice the work. And you require someone that also knows c++ know just someone that knows ECL. So we change that. And we use something now that is called dynamic ESDL. That outdoor generates, as I mentioned before these interfaces from ESP, as you go this DCECL, all they want, they'll expect that you will put it there, you will call the query with some permitted eyes interface to a query. And then automatically GSB will take those parameters and expose those in this front end interface for for users to consume the decade, we also have done quite a bit of integration in systems that do that can help with the benchmarking of HPC. availability, monitoring, and performance monitoring all of the capacity planning of HPC as well. So we are we try to integrate as much as we can with our components in the open source community. We truly love open source projects. So if there is a project that already has done something that we can leverage, we try to stay away from reinventing the wheel every time we use it. If it's not open source, if it's commercial, we do have a number of integration with commercial systems as well. We are not to relate, we are not religious about it. But certainly it's a little bit less enticing to put the effort into something that is closed source. And again, we we believe that the model in open source, he says it's a very good model, because it gives us It gives you the ability to know how things have done under the hood and extended and fixed them if you need to. We do this all the time with our projects. We we believe that it has a significant amount of value for for anyone out there.
Tobias Macey
0:50:26
On the subject of the open source nature of the project, I know that it was released is open source. And I think you said the 2011 timeframe, which posts dates when Hadoop had become popular and started to accrue its own ecosystem. I'm curious what your thoughts are on the relative strength of the communities for Hadoop and HPC. Currently, given that there seems to be a bit of a decline in Hadoop itself as far as the amount of utility that organizations are getting from it, but also interesting in the governance strategy that you have for the HPC platform and some of the ways that you approach sustainability of the project.
Flavio Villanustre
0:51:08
So you're absolutely right, the community has apparently at least reached a plateau at psychological and HPC systems community, in number of people. Of course, it was the first to the open. So we have HVC for a very long time he was closed source, he was proprietary, and we didn't we at the time, we believed that he was so core to our competitive advantage that we couldn't afford to release it in any other way. When we realized that reality, the core advantage that we have is on one side data assets on the other side is the high level algorithms. We knew that the platform would be better sustained in the long Randy and sustainability is an important factor for the platform for us because the platform is so core to everything we do that we believe that making it open source and free, completely free as both a no just a freedom of speech, but also free beer. We we thought that that would be the way to ensure this long term sustainability and development and an expansion and innovation in the platform itself. But when we did that it was 2011. So it was a few years after Hadoop, Hadoop, if you remember, it started as part of another project around the web crawling and what called management, which eventually ended up It's a song top level Apache project in 2008, I believe. So it was already three or four and a half years after hundred was out there. And they're coming to us really large. So over time, we did gather a fairly active community. And today we have inactive a very technical, deeply technical community. That is that not just a helps with extending and expanding HPC, but also provides a VS use cases, sometimes interesting use cases of HPC and a and uses HPC in general and regular regularly. So he would it be system community continues to grow, the community seems to have reached a plateau. Now there are other communities out there, which also handle some of the data management aspects with their own platforms like spark I mentioned before, which seems to have a better performance profile than what Hello Cass. So it has been also gathered in active, active people in those communities. Well, I think open source is not a zero sum game where if a community grows, the other one will decrease and then eventually, the total number of people in the community will be the same across all of them. I think every new platform that introduces capabilities to open source communities and uses new ideas and and helps develops, apply innovation into those ideas is helping the overall community in general. So it's great to see communities like a spark community growing. And I think there's an opportunity, and many of the users in both communities are using both at some point for all of them to leverage what is that in the others. Surely, sometimes, the specific language using gold in the platforms, makes a little bit of a bit created a little bit of a barrier. Some of these communities are now just because of the way Java is potentially more common, that use Java instead of c++ and C. So you see that sometimes the people that are in one community who may be more versed in Java, feel uncomfortable going and trying to understand the code in the other platform that is coded in a different language.
0:54:52
But even even then, at least
0:54:55
semi generally VSVLO difference on the on the functions I capabilities can be extracted and used to be added. And I think this is good for the overall benefit of everyone. I see, in many cases open source as a as a experimentation playground, where people can go there can bring new ideas, apply those ideas to some code, and then everyone else eventually leverages them because these ideas percolate across different projects. It's It's It's quite interesting. Having been involved personally in open sources, the early 90s. I I'm quite fond of the of the process, open source work. I think it's it's beneficial to everyone in the in every community.
Tobias Macey
0:55:37
And in terms of the way that you're taking advantage of the HPC platform, Lexis Nexis and some of the ways that you have seen it used elsewhere. I'm wondering what are some of the notable capabilities that you're leveraging and some of the interesting ways that you've seen other people take advantage of it?
Flavio Villanustre
0:55:54
that's a that's a good question. And
0:55:56
that my the answer might take a little bit longer. So in the in Lexis Nexis, in particular, certainly we use HPC. For almost everything we do, because almost everything we do is data management in some way or data quality. Now, we have interesting approaches to things is we have a number of processes that are done on a on data. One of those is this prolific linkage process. And prolific linkage requires sometimes quite a bit of code to make it work correctly. So there was a point where we were ability to finish EL and he was creating a code base that was getting more and more sizable, larger, bigger, less manageable. So at
0:56:39
some point, we decided
0:56:41
that level of abstraction that is pretty high anyway, in ECL, wasn't enough for prolific data linkage. So we created another language we called it sold and we the unrelated language is open source, by the way, it's still providing, but that language is a language that is you're going to consider it a domain specific language for data Liggett productively only get and data integration, so that a compiler for salt, compile salt into CL, and they feel compelled by this EL into c++, c++, clang or GCC compiler into assembler. So you can see how abstraction layers or like layers in an audience, of course, every time you apply an improvement and optimization in the sale compiler, or sometimes the GCC compiler team applies an optimization. And you see everyone else on top of that, of that layer benefits from the optimization, which is quite interesting. We like it so much that eventually we have another problem, which is dealing with graphs. And when I say graphs, I mean social graphs rather than
0:57:46
charts.
0:57:47
So we built yet another language that deals with graphs and machine learning, and particularly machine learning in graphs, which is called Cal or knowledge engineering language, which by the way, we don't have an open source version, but we do have version of the compiler out there for people that want to try. So Gil, also generation CL, and E LD, my c++ and again, back to the same point. So this is a is an interesting approach to building abstraction by Creek, DSL, domain specific languages on top of ACL and other interesting application of HPC, outside of Nexus Nexus is there is a company that is it's called guard pad, they do have that are smart, they can do geo fencing for workers, they can do a detection of of risky environments, in manufacturing environment or in construction. And so they use HPC. And they use some of their real time integration that we have a with things like Afghan couch to be and other integrations I mentioned that we have word activity on integrating HPC with other open source projects to essentially manage all of these data, which is fairly real time.
0:58:57
And a create
0:58:58
this real time Allah and then real time, machine learning, execution for models that they have and integration of data and even visualization on top of it. And and there are more and more a good I could go for days, giving you some some of the ideas there of things that we have done an hour and or others in the community have done using HPC.
Tobias Macey
0:59:21
And in terms of the overall experience that you have had working with HPC on both the platform and as a user of it, what have you found to be some of the most challenging aspects and some of the most useful and interesting lesson for you've learned in the process?
Flavio Villanustre
0:59:38
That is a great question. And
0:59:40
I'll give you a very simple answer. And then I'll explain what I mean. What are some of the biggest challenges, if you are a new user is ECL. Some of the biggest benefits are ACL. Unfortunately, no, not everyone is, is well versed in declarative programming models. So when you are exposed for the first time to a declarative language that
1:00:04
has immutability and laziness. And
1:00:09
the no side effects, it makes sometimes a little bit of a brain twister in some way, right there, you get to, you need to think the problems in a slightly different way to be able to solve them. When you install that it used to embed the programming, you typically solve the problem by decomposing the problem into a just a recipe of things that the computer this process needs to do, step by step one by one, when you do the collective programming, you decompose the problem in a set of functions that need to be applied, and you build it from the ground up. This is slightly different type of, of approach. But it once you get the idea how this works, it becomes quite powerful for a number of reasons, it becomes quite powerful, because first of all, you get to understand the problem more, and you can express the algorithms in a far more succinct way, it would have been just a collection of attributes. And some of the attributes depend on other attributes that you have defined, it also helps you with better encapsulate the components in the problem. So now you're cold instead of becoming just some sort of a spaghetti that is hard to troubleshoot is willing calculated, both in terms of function and calculation, also dating calculation. So if you need to touch anything later on, you can do it safely without need to worry about what this function could be doing that I'm calling here to any to go and also look at the function because you know, there are no side effects. And it also gives you the ability to ECA if you of course, as long as you name your attributes correctly. So people understand what they they are attempting to do, are they they are supposed to do, you can collaborate more easily with other people as well. So after a while, I realized that I was building code in ECLM, and others have also the same way, then realize that they coded the writing the CL is, first of all, mostly correct most of the time, which is not what you do when you have a non declarative code programming. And you know that if the code compiles, there is high chance that the code will run correctly. And it will give you a correct results after it runs. And like I say, was explained before when you have a dynamically typed language is imperative programming with side effects were, surely they called my compile, and maybe it will run fine if your times, but one day, it may give you some sort of runtime error, because some type is mismatch or some side effect that you consider when you re architect some piece of the code now is kicking in and getting your your results different from what you expected. I think, again, a CL has been really quite quite a blessing from that standpoint. But of course, it does require that you learn this you want to learn and you learn this new methodology of programming, which could be similar to what someone that knows, Python or Java needs to learn in order to apply SQL and execute against another declarative language. So use on code SQL interactively. When you are trying to
Tobias Macey
1:03:34
query a database looking forward in terms of the medium to long term, as well as some of the near term for the HPC platform itself. What do you have planned for the future, both in terms of technical capabilities and features, but also as far as community growth and outreach that you'd like to see.
Flavio Villanustre
1:03:53
So from the technical capabilities and features, we tend to have a community roadmap of things and try to as much as we can to stick with those roadmap. So we have some, these big ideas that tend to get into the next or the following major version, these smaller ideas that are typical, non disruptive, and don't break past compatibility that go into the minor versions. And then of course, these bug fixes.
1:04:23
Like many say they are not bugs, but opportunities.
1:04:26
But in the great
1:04:28
at the big ideas side of things, some of the things that we've been doing is doing better integration of I mentioned before integration with other open source projects is quite important. We've been also trying to change some of the underlying components in the blood, there were some components that we have had for a very, very long
1:04:46
time, like, for example, the
1:04:48
underlying communication layer, in Roxy. And for that we think they may be right now for a further revamping, by incorporating some of the standard communications out there. There is also the idea of making the platform far more cloud friendly, even though it does run very well in many public clouds and OpenStack, and Amazon and Google and Azure. But we want to also make the clusters more dynamic. I don't know if you spotted when I said that when you when I explain how you do data management, we're too busy. And he said, Well, you have a note for Well, what happens when you want to change the tenor thought and make it a 20 or 30, or a five notes, or maybe you have a small process, that would work fine with just a couple notes or one knows, you have a large process that may need 1000 nodes. Today, you can dynamically resize the four cluster, surely you can do every if you can resize it by hand, and then do a reboot of the data and now have the data in the number of nodes that you have. But it is a lot more involved than we would like to see it with dynamic cloud environments, the facilities becomes quite important because that's one of the benefits of cloud. So making the classes also more elastic. more dynamic is another big goal. Certainly, we continue to develop machine learning capabilities. On top of it, we have a library of machine learning functions of their algorithms methods. And we are expanding that we sometimes have even some of these machine learning methods, which are quite, I would say innovative one of our core developers and also researchers developed a new distributed algorithm for K means clustering, which she hasn't seen in the literature before. So it's part of one one a paper and her PhD dissertation, which is very good. And the other one is also part of HPC. Now, so now people can leverage this, which gives a significantly higher scalability to K means, particularly if you're running a very large number of nodes, I'm going to get into the details and how it is it creates he said this far better performance. But in in some it distributes the data less. And instead the students the center, it's more and it uses the associative property of the gaming, the main loop of played k means clustering to try to minimize the number of data records that need to be moved around. That's it from the standpoint of the roadmap and the platform itself. On the community side, we continue to try to expand the community as much as we can. One of our core interests is to get I mentioned this core developer who is a also researcher, we want to get more researchers and an academia on the platform, we have a number of initiatives, collaboration initiatives, with a number of universities in the US and abroad university like Oxford University in the UK, University College London, Humboldt University in in Germany, and a number of universities in the US, Clemson University, Georgia Tech and Georgia State University and Annika so we want to expand this program more, we also have an internship program, we believe that one of the one of the things that we see are the goals that we want to achieve as well with with the HPC systems project open source project is to also help balance better the community behind it from balancing diversity across the community. So attracting both both generally but in general, generally vertically and regionally about diversity and background diversity. So we are trying to also put quite a bit of emphasis in students, even high school students, so we are doing quite a bit of activity with high schools, on one side trying to get them more into technology. And of course, learn HPC, but also the outside try to also get more women into technology get more people that otherwise wouldn't get into technology, because they don't get exposed to technology in their homes. And so that's another core piece of activity in HPC, the HPC community. Last but not least, as part of this diversity, there are certain communities that are a little bit more disadvantaged than others. One of those is people in the autism spectrum. So we have been doing quite a bit of activity with organizations that are helping these, these people. So also trying to enable them with a number of activities. And some of those have to do with training them into HPC systems as a platform and data management to give them open opportunities for them for their lives. Many of these individuals are extremely intelligent, they're they're brilliant, they may have other limitations because of their, their conditions. But they will be very, very valuable resources, not just flexible solutions. Ideally, we could tell you there but even for other organizations as well,
Tobias Macey
1:09:48
it's great to hear that you have all these outreach opportunities as well for trying to help bring more people into technology as a means of giving back as well as as a means of helping to be your community and contribute to the overall use cases that it empowers. So for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today,
Flavio Villanustre
1:10:19
I think there are a number of gaps, but the major one is, many of the platforms that are out there tend to be quite clunky, when it comes to integrating things. Unfortunately, we are at the point where we are not, I don't think we are mature enough. So I'm mature enough. I mean, if if you are a data management person, you know data very well, you know, data analytics, you know, data process, but you don't necessarily know operating systems, you don't know, you are not a computer scientist that can deal with data partitioning and computational complexity of algorithms in partition data. And, and there are many details that are necessary for you to do your job should be unnecessary for you to lose your job correctly. But unfortunately, today because of the state of things, many times many of these systems commercial and non commercial force you to take care of all of the details or assemble a large team of people, from system administrators to network, network administrators to operating system specialist to a middle layer, especially some build, you can build a system that can you do your data management, the and that's something that we we do try to overcome with HPC giving the screen in this homogeneous system that you deploy with a single command and that you can use a minute later, after you deployed it, I will say that we are in the ideal situation yet I think there is still much to improve on but I think we are a little bit further along than many of the other options out there. You if you know the the Hadoop ecosystem, you know, how many different components of that are out there. And you know, if you have done this for for a while, you know that one day you realize that they said either know a security vulnerability in one component MB, and now you need to update that. But in order to do that, you're going to break the compatibility of the new version with something else. And now you need to update that other thing. But there is no update for another thing, because that thing depends on another component. So yeah, and this goes on and on and on. So having something that is homogeneous, that it doesn't require for you to be computer scientist to deploy and use. And that truly enables you are the abstraction layer that you need, which is data management is a is a significant limitation of many, many systems out there. And again, not just pointing this at the open source projects, and also commercial product as well. I think it's something that some of the people that are designing and developing the systems might not understand because they are not the users. But they should think as a user, you need to put yourself in the shoes of the user in order to be able to do the right thing. Otherwise, whatever you build is pretty difficult to apply. Sometimes it's useless.
Tobias Macey
1:13:03
Well thank you very much for taking the time today to join me and describe the ways that HPC is built and architected as well as some of the ways that it's being used both inside and outside of Lexis Nexis. So I appreciate all of your time and all the information there. And it's definitely a very interesting system and one that looks to provide a lot of value and capability. So I appreciate all of your efforts on that front. And I hope you enjoy the rest of your day.
Flavio Villanustre
1:13:30
Thank you very much. I really enjoyed this and I look forward to doing this again. So one day we'll get together again. Thank you

Digging Into Data Replication At Fivetran - Episode 93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that Fivetran solves and the story of how it got started?
  • Integration of multiple data sources (e.g. entity resolution)
  • How is Fivetran architected and how has the overall system design changed since you first began working on it?
  • monitoring and alerting
  • Automated schema normalization. How does it work for customized data sources?
  • Managing schema drift while avoiding data loss
  • Change data capture
  • What have you found to be the most complex or challenging data sources to work with reliably?
  • Workflow for users getting started with Fivetran
  • When is Fivetran the wrong choice for collecting and analyzing your data?
  • What have you found to be the most challenging aspects of working in the space of data integrations?}}
  • What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
  • What do you have planned for the future of Fivetran?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINOD. Today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening and databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing George Fraser about five Tran a platform for shipping your data to data warehouses in a managed fashion. So George, can you start by introducing yourself?
George Fraser
0:01:54
Yeah, my name is George. I am the CEO of five Tran. And I was one of two co founders of five trend almost seven years ago when we started.
Tobias Macey
0:02:02
And do you remember how you first got involved in the area of data management?
George Fraser
0:02:05
Well, before five train, I was actually a scientist, which is a bit of an unusual background for someone in data management, although it was sort of an advantage for us that we were coming at it fresh and so much has changed in the area of data management, particularly because of the new data warehouses that are so much faster and so much cheaper and so much easier to manage than the previous generation, that a fresh approach is really merited. And so in a weird way, the fact that none of the founding team had a background in data management was kind of an advantage.
Tobias Macey
0:02:38
And so can you start by describing it about describing a bit about the problem that five Tran was built to solve and the overall story of how it got started, and what motivated you to build a company around it?
George Fraser
0:02:50
Well, I'll start with the story of how it got started. So in late 2012, when we started the company, Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013, five turn was originally a vertically integrated data analysis tool. So it had user interface that was sort of a super powered spreadsheets slash BI tool, it had a data warehouse on the inside, and it had a data pipeline that was feeding the data warehouse. And through many iterations of that idea, we discovered that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and the data pipeline became the product. And the problem that five trans solves, is the problem of getting all your company's data in one place. So companies today use all kinds of tools to manage their business. You use CRM systems, like Salesforce, you use payment systems, like stripe support systems like Zendesk finance systems like QuickBooks, or Zora, you have a production database somewhere, maybe you have 20 production databases. And if you want to know what is happening in your business, the first step is usually to synchronize all of this data into a single database, where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that five trend solves people use by trying to do other things. Sometimes they use the data warehouse that We're sinking to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync all of that data into a single database.
Tobias Macey
0:04:38
And in recent years, one of the prevalent approaches for being able to get all the data into one location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning. And then you can experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake.
George Fraser
0:05:07
Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either set up a data warehouse do full on Kimball dimensional schema, data modeling, and Informatica with all of the upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s3. And I say false choice, because I think the right approach is a happy medium, where you don't go all the way to sticking raw JSON files and CSV files in s3, that's really unnecessary. Instead, you use a proper relational data store. But you exercise restraint, and how much normalization and customization you do on the way in. So you say, I'm going to make my first goal to create an accurate replica of all the systems in one database, and then I'm going to leave that alone, that's going to be my sort of staging area, kind of like my data lake, except it lives in a regular relational data warehouse. And then I'm going to build whatever transformations I want to do have that data on top of that data lake schema. So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead, your data lake should just be a different logical system within the same database that you're using to analyze all your data. And to support your BI tool. It's just a higher productivity simpler workflow to do it that way.
Tobias Macey
0:06:47
Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the LT pattern has been coming. Because of the flexibility of these cloud data warehouses that you've mentioned, as far as being able to consume semi structured and unstructured data while still being able to query across it and introspective for the purposes of being able to join with other information that's already within that system.
George Fraser
0:07:11
Yeah, the LT pattern is really a just a great way to get work done. It's simple. It allows you to recover from mistakes. So if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you want to transform the data. The great advantage of the LT pattern is that the original untransformed data is still sitting there side by side in the same database. So it's just really easy to iterate in a way that it isn't. If you're transforming the data on the fly, or even if you have a data lake where you like store the API responses from all of your systems, that's still more complicated than if you just have this nice replica sitting in its own schema in your data warehouse.
Tobias Macey
0:07:58
And so one of the things that you pointed out is needing to be able to integrate across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce for CRM, or things like ticket tracking, and user feedback, such as Zendesk, etc. And I'm wondering what your experience has been as far as being able to map the sort of logical entities across these different systems together to be able to effectively join and query across those data sets, given that they don't necessarily have the shared sense of truth for things like how customers are presented, or even what these sort of common field names might be to be able to map across those different, those different entities.
George Fraser
0:08:42
Yeah, this is a really important step. And the first thing we always advise our customers to do. And even anyone who's building a data warehouse, I would advise to do this is that you need to keep straight in your mind that there's really two problems here. The first problem is replicating all of the data. And the second problem is analyzing all the data into a single schema. And you need to think of these as two steps, you need to follow proper separation of concerns, just as you would in a software engineering project. So we really focus on that first step on replication. What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use sequel, sequel is a great tool for unionizing things, joining things, changing field names, filtering data, all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of sequel queries that run inside your data warehouse.
Tobias Macey
0:09:44
And do you have your own tooling and interfaces for being able to expose that process to your end users? Or do you also integrate with tools such as DBT, for being able to have that overall process controlled by the end user. So
George Fraser
0:10:00
we originally did not do anything in this area other than give advice, and we got the advantage that we got to sort of watch what our users did in that context. And what we saw is that a lot of them set up cron to run sequel scripts on a regular schedule. A lot of them used liquor, persistent Dr. Tables, some people use airflow, they used air flow, and kind of a funny way, they didn't really use the Python parts of air flow, they just use their flow as a way to trigger sequel. And when DVD came out, we have a decent community of users who use DBT. And we're supportive of whatever mechanism you want to use to transform your data, we do now have our own transformation tool built into our UI. And it's the first version that you can use right now. It's basically a way that you can provide the sequel script, and you can trigger that sequel script, when five Tran delivers new data to your tables. And we've got lots of people using the first version of that that's going to continue to evolve over the rest of this year, it's going to get a lot more sophistication. And it's going to do a lot more to give you insight into the transforms that are running, and how they all relate to each other. But the core idea of it is that sequel is the right tool for transforming data.
Tobias Macey
0:11:19
And before we get too far into the rest of the feature set and capabilities of five Tran, I'm wondering if you can talk about how the overall system is architected, and how the overall system design has evolved since you first began working on it.
George Fraser
0:11:33
Yeah, so the overall architecture is fairly simple. The hard part of five trend is really not the sort of high class data problems, things like queues and streams and giant data sets flying around. The hard part of five trend is really all of the incidental complexity of all of these data sources, understanding all the small sort of crazy rules that every API has. So most of our effort over the years has actually been devoted to hunting down all these little details of every single data source we support. And that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture was essentially a bunch of EC two instances, with cron, running a bunch of Java processes that were running on a on a fast batch cycle, sinking people's data. Over the last year and a half, the engineering team has built a new architecture based on Kubernetes. There are many advantages of this new architecture for us internally, the biggest one is that it auto scales. But from the outside, you can't even tell when you migrate from the old architecture to the new architecture other than you have to whitelist a new set of IPS. So the you know, it was a very simple architecture. In the beginning, it's gotten somewhat more complex. But really, the hard part of five train is not the high class data engineering problems. It's the little details of every data source, so that from the users perspective, you just get this magical replica of all of your systems in a single database.
Tobias Macey
0:13:16
And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers. I'm curious what you're using for monitoring and alerting strategy and any sort of ongoing continuous testing, as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?
George Fraser
0:13:42
Yeah, well, first of all, there's several layers to that the first one you is actually the testing that we do on our end to validate that all of our sink strategies, all those little details I mentioned a minute ago are actually working correctly, our testing problem is quite difficult, because we interoperate with so many external systems. And in many cases, you really have to run the tests against the real system for the test to be meaningful. And so our build architecture is actually one of the more complex parts of five train, we use a build tool called Bazell. And we've done a lot of work, for example, to run all of the databases and FTP servers and things like that, that we have to interact with in Docker containers so that we can actually produce reproducible Ed tests. So that actually is one of the more complex engineering problems at five trend. And if that sounds interesting to you, I encourage you to apply to our engineering team, because we have lots more work to do on that. So that's the first layer is really all of those tests that we run to verify that our sync strategies are correct. The second layer is that, you know, is it working in production is the customers data actually getting sick and as a getting synced correctly, and one of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all five trans data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize, we stop. Now, that's different than when you build data pipelines yourself, when you build data pipelines for your own company, usually, you will have them try to keep going no matter what. But five train is a fully managed service. And we're monitoring and all the time. So we tend to make the opposite choice of anything suspicious is going on, the correct thing to do is just stop and alert five Tran, hey, go check out this customers data pipeline, what the heck is going on? Something unexpected happen is happening. And we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail, we get alerted and the customer gets alerted at the same time. And then we communicate with the customer. And we say hey, we may need to go in and check something Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out what's going wrong, because five trained as a fully managed service. And that is critical to making it work. When you do we do and you say we are going to take responsibility for actually creating an accurate replica of all of your systems in your data warehouse. That means you're signing on to comprehend and fix every little detail of every data source that you support. And a lot of those little details only come up in production when some customer shows up. And they're using a feature of Salesforce that Salesforce hasn't sold for five years, but they've still got it. And you've never seen it before. Some of a lot of those little things only come up in production. The nice thing is that that set of little things, well, it is very large, it is finite. And we only have to discover each problem once and then every customer thereafter. benefits from that. Thanks
Tobias Macey
0:17:00
for the system itself. One of the things that I noticed while I was reading through the documentation, and the feature set is that for all of these different source systems, you provide automated schema normalization. And I'm curious how that works. And the overall logic flow that you have built in, if it's just a static mapping that you have, for each different data source, are there some sort of more complex algorithm that's going on behind the scenes there, as well as how that works for any sort of customized data sources, such as application databases that you're working with, or maybe just JSON feeds or event streams?
George Fraser
0:17:38
Sure. So the first thing you have to understand is that there's really two categories of data sources in terms of schema normalization. The first category is databases, like Oracle, or my sequel, or Postgres, and database, like systems like NetSuite is really basically a database when you look at the API. So Salesforce, there's a bunch of systems that basically look like David bases, they have arbitrary tables, columns, you can set any types you want in any column, what we do with those systems is we just create an exact one to one replica of the source schema, it's really as simple as that. So there's a lot of work to do, because the change feeds are usually very complicated from those systems. And it's very complex. To turn those change feeds back into the original schema, but it is automated. So for databases and database, like skeet systems, we just produce the exact same schema and your data warehouse as it was in the source for apps are things like stripe, or Zendesk or GitHub or JIRA, we do a lot of normalization of the data. So tools like that, when you look at the API responses, the API responses are very complex and nested, and usually very far from the original normalized schema that this data probably lived in, in the source database. And every time we add a new data source of that type, we study the data source, we, I joke that we reverse engineer the API, we basically figure out what was the schema and the database that this originally was, and we unwind all the API responses back into the normalized schema. These days, we often just get an engineer at the company that is that data source on the phone and ask them, you know, what is the real schema here, we can find we found that we can save ourselves a whole lot of work by doing that. But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think, if we put in that work up front to normalize the data in your data warehouse, we can save every single one of our customers a whole bunch of time, traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in up front so our customers don't have to.
Tobias Macey
0:20:00
One of the other issues that comes up with normalization. And particularly for the source database systems that you're talking about is the idea of schema drift, when new fields are added or removed, or a data types change, or the overall sort of the sort of default data types change. And we're wondering how you manage schema drift overall, in the data warehouse systems that you're loading into well, preventing data loss, particularly in the cases where a column might be dropped, or the data type changed.
George Fraser
0:20:29
Yeah, so it's, it's, there's a core pipeline that all five trend, connectors, databases, apps, everything is written against that we use internally. And all of the rules of how to deal with schema drift are encoded there. So some cases are easy. Like, if you drop a column, then that data just isn't arriving anymore, we will leave that column in your data warehouse, we're not going to delete it in case there's something important in it, you can drop it in your data warehouse, if you want to, we're not going to, if you add a column, again, that's pretty easy. We add a column and your data warehouse, all of the old rows will have nodes in that column, obviously, but then going forward, we will populate that column. The tricky cases are when you change the types. So when you when you alter the type of an existing column that can be more difficult to deal with. Now, we will actually, there's two principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're going to go and change the type of the column in your data warehouse to fit the new data. And the second principle we follow is that when you change types, sometimes you sort of contradict yourself. And we follow the rule of subtitling in in handling that, if you think back to your undergraduate computer science classes, this is the good old concept of subtypes, for example, and into the subtype of a real a real is a subtype of a string, etc. So we, we look at all the data passing through the system, and we infer what is the most specific type that can contain all of the values that we have seen. And then we alter the data warehouse to be that type, so that we can actually fit the data into the data warehouse.
Tobias Macey
0:22:17
Another capability that you provide is Change Data Capture for when you're loading from these relational database systems into the data warehouse. And that's a problem space that I've always been interested in as far as how you're able to capture the change logs within the data system, and then be able to replay them effectively to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform?
George Fraser
0:22:46
Yeah, it's very complicated. Most people who build in house data pipelines, as you say, they just do a dump and load the entire table, because the change logs are so complicated. And the problem with dumping load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people once a day. So we do Change Data Capture, we read the change logs of each database, each database has a different change log format, most of them are extremely complicated. If you look at the MySQL change log format, or the Oracle change log format, it is like going back in time to the history of MySQL, you can sort of see every architectural change in MySQL in the change log format the answer to how we do that there's, there's no trick. It's just a lot of work, understanding all the possible corner cases of these chains lugs, it helps that we have many customers with each database. So the unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users, we have lots of Postgres users. And so over time, we see all the little corner cases, and you eventually figure it out, you eventually find all the things and you get a system that just works. But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at five trend, who at this point, has been working on it for years with, you know, hundreds of customers. So at this point, it's you know, we've invested so much effort in tracking that all those little things, there's just like no hope that you could do better yourself, building a change the reader just for your own company
Tobias Macey
0:24:28
for the particular problem space the year and you have a sort of many too many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses. And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to get it production ready in the shortest amount of time.
George Fraser
0:24:57
Yeah, it's funny, you know, if you ask any engineer, five Randall, they can all tell you what the most difficult data sources are, because we've had to do so much work on on them over the years. Undoubtedly, the most difficult data sources is Mark Hedo, close seconds or JIRA, Asana and then probably NetSuite. So those API's, they just have a ton of incidental complexity, it's really hard to get data out of them fast. We're working with some of these sources to try to help them improve their API's to make it easier to do replication, but there there's a handful of data sources that have required disproportionate work to to get them working reliably. In general, one funny observation that we have seen over the years is that the companies with the the best API's tend to unfortunately be the least successful companies. It seems to be a general principle that companies which have really beautiful well, organized API's tend to not be very successful businesses, I guess, because they're just not focused enough on sales or something. We've seen it time and again, where we integrate a new data source, and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sink for them. The one exception, I would say is stripe, which has a great API, and is a highly successful company. And that's probably because their API is their products. So there's there's definitely a spectrum of difficulty. In general, the oldest largest companies have the most complex API's,
Tobias Macey
0:26:32
I wonder if there's some reverse incentive where they make their API's obtuse and difficult to work with, so that they can build up an ecosystem around them of contractors who are whose sole purpose is to be able to integrate them with other systems.
George Fraser
0:26:46
You know, I think there's a little bit of that, but less than you would think. For example, the company that has by far the most extensive ecosystem of contractors, helping people integrate their tool with the other systems is Salesforce. And Salesforce is API is quite good. Salesforce is actually one of the simpler API is out there. It was harder a few years ago when we first implemented it. But they made a lot of improvements. And it's actually one of the better API's now.
Tobias Macey
0:27:15
Yeah, I think that's probably coming off the tail of their acquisition of MuleSoft to sort of reformat their internal systems and data representation to make it easier to integrate. Because I know beforehand, it was just a whole mess of XML.
George Fraser
0:27:27
You know, it was really before the meal soft acquisition that a lot of the improvements in the Salesforce API happened, the Salesforce REST API was I was pretty well structured and rational, five years ago, it would fail a lot, you would send queries and they would just not return when you had really big data sets, and now it's more performance. So I think it predates the Millsap acquisition, they just did the hard work to make all the corner cases work reliably and scale the large data sets and, and Salesforce is now one of the easier data sources to actually think there are certain objects that have complicated rules. And I think the developers at five train who work on Salesforce will get mad at me when they hear me say this. But compared to like NetSuite, it's, it's pretty great.
Tobias Macey
0:28:12
On the other side of the equation, where you're loading data into the different target data warehouses, I'm wondering what your strategy is, as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of equal representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses.
George Fraser
0:28:36
So most of the code for doing the load side is shared between the data warehouses, the differences are not that great between different destinations, except Big Query Big Query is a little bit of a unusual creature. So if you look at five trans code base, there's actually a different implementation for Big Query that shares very little with all of the other destiny. So the differences between destinations are not that big of a problem for us, there are certain things that that do, you know, there's functions that have to be overwritten for different destinations for things like the names of types and, and there's some special cases around performance where our load strategies are slightly different, for example, between snowflake and redshift, just to get faster performance. But in general, that actually is the easier side of the business is the destinations. And then in terms of transformations, it's really up to the user to write the sequel that transforms their data. And it is true that to write effective transformations, especially incremental transformations, you always have to use the proprietary features of the particular database that you're working on.
Tobias Macey
0:29:46
On the incremental piece, I'm interested in how you address that for some of the different source systems, because for the databases, where you're doing Change Data Capture, it's fairly obvious that you can take that approach for a data loading. But for some of the more API oriented systems, I'm wondering if there are if there's a high degree of variability of being able to pull in just the objects that have changed since a certain last sync time, or if there are a number of systems that will just give you absolutely everything every time and then you have to do the thing on your side,
George Fraser
0:30:20
the complexity of those dangers. I know I mentioned this earlier, but it is it is staggering. But yes, I'm the API side, we're also doing Change Data Capture of apps, it is different for every app, but just about every API we work with provides some kind of change feed mechanism. Now it is complicated, you often end up in a situation where the API will give you a change feed that's incremental, but then other endpoints are not incremental. So you have to do this thing where you read the change feed, and you look at the individual events and the change feed, and then you go look up the related information from the other entity. So you end up dealing with a bunch of extra complexity because of that. But as with all things at five train, we have this advantage that we have many customers with each data source. So we can, we can put in that disproportionate effort that you would never do if you were building it just for yourself to make the change capture mechanism work properly, because we just have to do it once and then everyone who uses that data source can benefit from it.
Tobias Macey
0:31:23
For people who are getting on boarded onto the five trans system. I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like, as they're adding new sources or just interacting with their five trading account for being able to keep track of the overall health of their system, or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side,
George Fraser
0:31:47
it's pretty simple. The joke at five trend is that our demo takes about 15 seconds. So because we're so committed to automation, and we're so committed to this idea that five trends fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it, it means that there's very little UI, the process of setting up a new data source is basically Connect source, which for many sources is as simple as just going through an OAuth redirect, and you just click you know, yes, 510 is allowed to access my data. And that's it. And connect destination which, which now we're actually integrated with snowflake and big queries, you can just push a button in snowflake or in Big Query and create a five train account that's pre connected to your data warehouse. So the setup process is really simple. There's once after setup, there's a bunch of UI around monitoring what's happening, we like to say that five Tran is a glass box, it was originally a black box. And now it's it's a glass box, you can see exactly what it's doing. You can't change it. But you can see exactly what we're doing at all times. And you know, part of that is in the UI. And part of that is an emails you get when things go wrong and or the sink finishes for the first time, that kind of thing.
Tobias Macey
0:33:00
Part of that visibility, I also noticed that you will ship the transaction logs to the end users log aggregation system. And I thought that was an interesting approach, as far as being able to give them away to be able to access all of that information in one place without having to go to your platform just for that one off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall visibility into your systems and the ways that they're able to integrate it into their monitoring platforms?
George Fraser
0:33:34
Yeah, so the logs we're talking about are the logs of every action five train took like five drain made this API call against Salesforce five ran ran this log minor query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like cloud watch or stack driver, because a lot of companies have like a in the same way, they have a set centralized data warehouse, they have a centralized logging system. It's mostly used by larger companies, those are the ones who invest the effort in setting up those centralized logging systems. And it's actually the system we built first, before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just there's a quick way to view what's going on. And, yeah, I think people have appreciated that we're happy to support the systems they already have, rather than try to build our own thing and force you to use that.
Tobias Macey
0:34:34
I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall lifecycle of their data as it spans across different systems.
George Fraser
0:34:47
You know, that's not so much of a logging problem, that's more of like a metadata problem inside the data warehouse, when you're trying to track lineage to say, like, this row in my data warehouse came from this transformation, which came from these three tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etc. that lineage problem is really more of a metadata problem. And that's kind of a Greenfield in our area right now. There's a couple different companies that are trying to solve that problem. We're doing some interesting work on that in conjunction with our transformations. I think it's a very important problem. It's still still a lot of work to be done there.
Tobias Macey
0:35:28
So on the sales side of things to I know, you said that your demo is about 15 seconds as far as Yes, you just do this, this and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have that bring them to you as far as evaluating your platform for their use cases. And just some of the overall user experience design that you've put into the platform as well, to help ease that onboarding process.
George Fraser
0:35:58
Yeah, so a lot of the discussions in the sales process really revolve around that ELT philosophy of five train is going to take care of replicating all of your data, and then you're going to cure curated non-destructively using sequel, which for some people just seems like the obvious way to do it. But for others, this is a very shocking proposition, this idea that your data warehouse is going to have this comparatively and curated schema, that five trend is delivering data into and then you're basically going to make a second copy of everything. For a lot of people who've been doing this for a long time. That's a very surprising approach. And so a lot of the discussion and sales are rolls around the trade offs of that and why we think that's the right answer for the data warehouses that exists today, which are just so much faster, and so much cheaper, that it makes sense to adopt that more human friendly workflow than maybe it would have in the
Tobias Macey
0:36:52
90s. And what are the cases where five trend is the wrong choice for being able to replicate data or integrated into it data warehouse?
George Fraser
0:37:00
Well, if you already have a working system, you should keep using it. So I would we don't advise people to change things just for the sake of change. If you've set up, you know, a bunch of Python scripts that are sinking all your data sources, and it's working, keep using it, what usually happens that causes people to take out a system is schema changes, death by 1000 schema changes. So they find that the data sources upstream are changing, their scripts that are sinking, their data are constantly breaking, it's this huge effort to keep them alive. And so that's the situation where prospects will abandon existing system and adopt five trend. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it, just just keep using it.
Tobias Macey
0:37:49
And as far as the overall challenges or complexities of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult overcome, or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering building their own pipeline from scratch.
George Fraser
0:38:11
Yeah, you know, I think that when we got our first customer in 2015, sinking Salesforce to redshift, and two weeks later, we got our second customer thinking Salesforce and HubSpot and stripe into redshift, I sort of imagined that this sync problem was like going to be we were going to have this solved in a year. And then we would go on and build a bunch of other related tools. And the sink problem is much harder than it looks at first, getting all the little details, right. So that it just works is an astonishingly difficult problem. It, it is a parallel parallel problem, you can have lots of developers working on different data sources, figuring out all those little details, we have accumulated general lessons that we've incorporated and adore core code. So we've gotten better at doing this over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of complexity associated with just doing replication correctly, was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem, you know, you look at the API docs have a data source, and you figure Oh, I think I know how I'm going to sync this. And then you go into production with 10 customers. And suddenly, you find 10 different corner cases that you never thought of that are going to make it harder than you expected to sink the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing.
Tobias Macey
0:39:45
on both the technical and business side, I'm also interested in understanding what you have found to be as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing five Tran?
George Fraser
0:39:59
Well, I've talked about some of the technical lessons in terms of you know, just solving that problem really well as is both really hard and, and really valuable. In terms of the business lessons we've learned. It's, you know, growing the company is like a co equal problem to growing the technology, I've been really pleased with how we've made a place where people seem to genuinely like to work, where a lot of people have been able to develop their careers in different ways different people have different career goals. And you need to realize that as someone leading a company, not everyone at this company is like myself, they have different goals that they want to accomplish. So that that problem of growing the company is just as important. And just as complex as solving the technical problems and growing the product and growing the sales side and helping people find out that you have this great product that they should probably be using. So I think that has been a real lesson for me over the last seven years that we've been doing this now for the future of five trend, what do you have planned both on the business roadmap, as well as the feature sets that you're looking to integrate into five Tran and just some of the overall goals that you have for the business as you look forward?
Tobias Macey
0:41:11
Sure. So
George Fraser
0:41:12
some of the most important stuff we're doing right now is on the sales and marketing side, we have done all of this work to solve this replication problem, which is very fundamental and very reusable. And I like to say no one else should have to deal with all of these API's. Since we have done it, you should not need to write a bunch of Python scripts to sink your data or configure Informatica or anything like that. And we've done it once so that you don't have to, and I guarantee you, it will cost you less to buy five trend than to have your own team basically building a house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that five trends out there. And that might be something that's really useful to you on the product side, we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations tool in our product, there's going to be a lot more sophistication getting added to that over the next year, we really view that as the next frontier for five trend is helping people manage the data after we've replicated that,
Tobias Macey
0:42:17
are there any other aspects of the five train company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show?
George Fraser
0:42:28
I don't think so I think the the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources. The you know, Kafka is not going to solve this problem for you. spark is not going to solve this problem for you. There is no fancy technical solution. Most of the difficulty of the data centralization problem is just in understanding and working around all of the incidental complexity of all these data sources.
Tobias Macey
0:42:58
For anybody who wants to get in touch with you or follow along with the work that you and five Tran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
George Fraser
0:43:15
Yeah, I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data after it arrives. So writing all the sequel that curates the data into a format that's ready for the business users to attack with BI tools is a huge amount of work, it remains a huge amount of work. And if you look at the workflow of the typical analysts, they're writing a ton of sequel. And they're using tools that it's a very analogous problem to a developer writing code using Java or C sharp, but the tools that analysts have to work with look like the tools developers had in like the 80s. I mean, they don't even really have autocomplete. So I think that is a really under invested then problem, just the tooling for analysts to make them more productive in the exact same way. As we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts to and I think it hasn't happened yet.
Tobias Macey
0:44:13
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it five Tran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front and I hope Enjoy the rest of your day.
George Fraser
0:44:31
Thanks for having me on.

Simplifying Data Integration Through Eventual Connectivity - Episode 91

Summary

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
  • What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
  • In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
    • How do different implementations of graph databases impact their viability for this use case?
  • Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
  • How much up-front modeling is necessary to make this a viable approach to data integration?
  • How do the volume and format of the source data impact the technology and architecture decisions that you would make?
  • What are the limitations or edge cases that you have found when using this pattern?
  • In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
  • What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities within startups that are changing the world than Angel list is the place to go go to data engineering podcast.com slash angel to sign up today. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing Tim Ward about his thoughts on eventual connectivity is a new pattern to replace traditionally to and just as a full disclosure, Tim is the CEO of clued in who is a sponsor of the podcast. So Tim, can you just start by introducing yourself?
Tim Ward
0:02:09
Yeah, sure. My name is Tim board. As Tobias said, I'm the CEO of a data platform called clued in. I'm based out of Copenhagen, Denmark, I have with me my wife, my little boy, Finn, and a little dog that looks like and he walk called Seymour.
Tobias Macey
0:02:29
And do you remember how you first got involved in the area of data management?
Unknown
0:02:32
Yeah, so I mean, I'm, I'm, I guess, a classically trained software engineer, I've been working in software space for around 1314 years now, I've been predominantly working in the web space, but mostly for enterprise businesses. And around, I don't know, maybe six or seven years ago, I was given a project, which was in the space of what's called multivariate testing, it's the idea that if you you've got a website, and maybe the homepage of a website, if we make some changes, or different variations, which which variation works better for the amount of traffic that you're wanting to attract, or maybe the amount of purchases that the company makes on the website. So I mean, using this, that was my first foray into, okay, so this involves me having to capture data on analytics, that then took me down this rabbit hole of realizing, got it, I have to not only get the analytics from the website, but I need to correlate this against, you know, back office systems, CRM systems and, you know, ERP systems and pin systems. And I kind of realized, Oh, my God, this becomes quite tricky with the integration piece. And once I went down that rabbit hole, I realized all for me to actually start doing something with this data, I need to clean it, I need to normalize it. And, you know, basically, I got to this point where I realized what data is kind of a hard thing to work with, it's not something you can pick up and just start getting value out of straight away. So that's kind of what led me into the the path of around four half, five years ago saying, you know what, I'm going to get into this data space. And ever since then, I've just enjoyed immensely being able to help large enterprises in becoming a lot more data driven.
Tobias Macey
0:04:31
And so to frame the discussion a bit, I'm wondering if you can just start by discussing some of the challenges and shortcomings that you have seen in the existing practices of ET? Oh,
Unknown
0:04:42
yes, sure. I mean, I guess I want to stop by not trying to be that grumpy old man that's yelling at all technologies. And I'm always this person. That is one thing I've learned in my Korea is that it's very rare that a particular technology or approach is right, or wrong. It's just right, for the right use case. And I mean, also, you're seeing a lot more patterns in in integration emerge, of course, we've got the ATL that's been around forever, you've got this LT approach, which has been emerging over the last few years. And then you've kind of seen streaming platforms also take up the idea of joining streams of data instead of something that is kind of done up front. And, and, you know, to be honest, I've always wondered, with ATL, how on earth are people achieving this for an entire company, you know, like ATL for me has always been something that if you've got 234 tools to be able to integrate, it's a fantastic kind of approach, right? But you know, now we're definitely dealing with a lot more data sources, and the demand for having free flowing data available is becoming much more apparent. And it was to the point where I thought, am I the stupid one like, I can't, if I have to use ATL to integrate data from multiple sources, as soon as we go over a certain limit of data sources, the problem just exponentially becomes a lot harder. And I think the thing that I found interesting as well with this ATL approach is that typically, once the data was processed through these classic, you know, designers, workflow gags, you know, directed a cyclical graphs. And the output of this process was typically, oh, I'm going to store this in a relational database. And therefore, you know, I can understand why ETFs existed, I can understand that. Yeah, if, if you know what you're going to do with your data after this ATL process, I mean, classically, you would go into something like a data warehouse, I can see why that existed. And I think I think there's just different types of demands that are in the market today, there's much more need for, you know, flexibility and access to data, and not necessarily data that's been modeled as rigidly as you do get in the kind of classical data warehouses. And I kind of thought, well, the relational database is not the only database available to us, as engineers, and one of the ones that I've been focusing on for the last few years is this graph database. And I kind of when you think about it, most problems that we're trying to solve in the modeling world today, they are actually a network, they are a graph, they're not necessarily a relational or a kind of flat document store. So I thought, you know, this seems more like the right store to be able to model the data. And I mean, I think the second thing was that, just from being hands on, I found that this ATL process, what it meant was that when I was trying to solve problems and integrate data up front, I had to know what we're all about business rules that dictated how the systems integrate, but what dictated clean data, and you probably Tobias used to these ETFs, designers, where I get these built in functionalities to do things like, you know, trim white space, and tokenize, the text and things like that, and you think, yet, but I need to know up front, what is considered a bad ID or a bad record, you're probably also used to seeing, you know, we've got these IDs, and sometimes it's, it's a beautiful looking ID and sometimes it's negative one, or na or, you know, placeholder or hyphen, and you think I've got it up front in the ATL world define what are all those possibilities before I run my ATL job, and I just found this quite rigid in its approach. And, and I think the king kind of game changer for me was that when I was using ATL and these classes, designers to integrate more than five systems, I realized how long up front, it took that I needed to go around the different parts of the business and have them explain. Okay, so how does the Salesforce lead table connect to the market early table? Like, how does it do that, and then time after time, after, you know, weeks of investigation, I would realize, Oh, I have to jump to the I don't know, the exchange, the exchange server or the Active Directory to get the information that I need to join those two systems. And it just, it just resulted in the spaghetti of point to point integrations. And I think that's one of the key things that ETFs suffers from is that it puts us in an architectural design thinking pattern of Oh, how am I going to map systems point to point and I can tell you after working in this industry for five, five years so far, that systems don't naturally blend together point to point.
Tobias Macey
0:10:04
Yeah, your point to about the fact that you need to understand what are all the possible representations of a no value means that in order for a pipeline to be sufficiently robust, you have to have a fair amount of quality testing built in to make sure that any values that are coming through the system map to some of the existing values that you're expecting, and then be able to raise an alert when you see something that's outside of those bounds, so that you can then go ahead and fix it. And then being able to have some sort of a dead letter Q or bad data queue for holding those records, until you get to a point where you can reprocess them, and then being able to go through and back populate the data. So it definitely is something that requires a lot of engineering effort in order to be able to have something that is functional for all the potential values. And also there's the aspect of schema evolution and being able to figure out how to propagate that through your system and have your logic, flexible enough to be able to handle different schema values for cases where you have data flowing through that is at the transition boundary between the two schemas. So certainly a complicated issue. And so you recently released a white paper, discussing some of your ideas on this concept of eventual connectivity. And so wondering if you can describe your thoughts on that and touch on how it addresses some of the issues that you've seen with the more traditional ATL pattern.
Unknown
0:11:38
Yeah, sure. I mean, I think one of the concepts that behind this pattern, we've kind of named eventual connectivity and is the it there's there's a couple of fundamental things to understand. First of all, it's a it's a pattern that essentially embraces the idea that we can throw data into a store. And as we continue to throw more data records will find out itself how to be merged. And it's the idea of being able to place records into this kind of central place this central repository with little hooks, with little hooks that are flags that are indicating, hey, I'm a record. And here are my unique references. So, you know, obviously with the idea being that as we bring in more systems, those other records will say, Hey, I actually have the same ID. Now, that might not happen up front, it might be after you've integrated system 123456, that system, two and four are able to say, Hey, I now have the missing pieces to be able to merge our records. So in an eventual connectivity world. What this this really brings in advantages is that, first of all, if I'm trying to integrate systems, I only need to take one system at a time. And I found it rare in the enterprise that I could find someone who understood the domain knowledge behind their Salesforce account, and their Oracle, Oracle account or and their Marketa account, I would often run into someone who completely understood the business domain behind the Salesforce account. And for the reason I'm using that as an example is because Salesforce is an example of a system where you can do anything in it, you can add objects, that, uh, you know, animals are dinosaurs, not just the ones that are out of the box, I don't know who's selling to dinosaurs. But essentially, what this allows me to do is when I walk into an integration job, and that business says, Hey, we have three systems, I say, got it. And if they say, Oh, sorry, that was actually 300 systems, I go, God, it makes no difference. To me, it's only a time based thing, the complexity doesn't get more complex because of this type of pattern that we're taking. And I'll explain the pattern. Essentially, what we do is we, you can conceptualize it, as we go through a system, a record at a time or an object at a time, let's take something like leads or contacts. And the patent basically asks us to highlight what our unique references to that object. So in the case of a person, it might be something like a passport number, it might be, you know, a local personal identifier. And you know, in Denmark, we have what's called the CPI number, which is a unique reference to me, no one else in Denmark and have the same number, then you get to things like emails, and what you discovered pretty quickly in, in enterprise, in enterprise data world is that email in no way as a unique identifier of an object, right, we can have group emails that refer to multiple different people. And, you know, not all systems will specify as if this is a group email of this is a an email referring to an individual. So the pattern asks us or dictates us to mock those references as aliases, something that could allude to a unique identifier of an object. And then when we get to the referential pieces, so imagine that we have a contact that's associated with a company, you could probably imagine that as a call a column in the contact table, that's called company ID. And the key thing with the eventual connectivity pattern is that, although I want to highlight that as a unique reference to another object, I don't want to tell the the integration pattern where that object exists, I don't want it to tell that it's in the Salesforce organization table. Because to be honest, if that's a unique reference, that unique reference my exist in other systems. And so what this means is that I can take an individual system at a time and not have to have this standard Point to Point type of relationship between data. And I think if I was a highlight kind of three main wins that you get out of this, I think the first is that it's quite powerful to walk into a large business and say, hey, how many systems do you have? Well, we have 1000. And I think, good, when can we start? Now if I was in the ATL approach, I will be thinking, Oh, God, are we can we actually honestly do this, like,
0:16:40
as you could probably know, yourself, Tobias, often, we go into projects with big smiling faces. And then when you see the data, you realize, Oh, this is going to be a difficult project. So that advantage of being able to walk in and say I don't care how many systems you have, it makes not a lot of complexity difference to me. I think the other pieces that the eventual connectivity pattern addresses this idea of that you don't need to know all the business rules up front of what, how systems connect to each other, but then what's considered bad data versus good data. And in rather that, you know, we let things happen, and we have a much more reactive approach to be able to rectify them. And I think this is more cognizant, or it's more representative of the world we look into, that we live in today, companies are wanting more real time data to their consumers or to the consumption technologies, where we get the value things like business intelligence, etc. And they're not willing to put up with these kind of rigid approaches of Oh, detail processes broken down, I need to go back to our design, I need to update that and run it through and make sure that we we guarantee that, you know, the data is in the perfect order. Before we actually do the merging. I think the final thing that has become obvious time after time, where I've seen companies use this pattern is that this eventual connectivity pattern will discover joins, where it's really hard for you and me to sit down and figure out where these joins are. And I think it comes back to this core idea that systems don't connect well point to point, there's not always a nice ID that are this ubiquitous ID that we can just join two systems together, often we have to jump in between different data sources, to be able to wrangle this into a unified type of set of data. Now, at the same time, I can't deny that, you know, like, there's quite a lot of work that's going on in the field of, you know, ETFs, you've got platforms like Nye phi, and air flow, and you know what, those are still very valid, they're still, you know, they're very good at moving data around, they're fantastic at breaking down a workflow into these kind of just Greek components that can, in some cases, play independently. I think that the eventual connectivity patent for us time after time has allowed us to blend systems together without this overhead of complexity. And Tobias, there's not a big enough whiteboard in the world, when it comes to integrating, you know, 50 systems, it, you just have to put yourself in that situation, realize, oh, wow, the old ways of doing it, I just not scaling.
Tobias Macey
0:19:31
And as you're talking through this idea of eventual connectivity, I'm wondering how it ends up being materially different from a data lake where you're able to just do the more ELT pattern of just ship all of the data into this repository without having to worry about modeling it up front and understanding what all the mappings are, and then doing some exploratory analysis after the fact to be able to then create all of these connection points between the different data sources and do whatever cleaning happens after the fact.
Unknown
0:20:03
Yeah, I mean, you one thing I've gathered in my career, as well as that, you know, something like a an overall data pipeline for a business is going to be made up of so many different components. And in our world, in the in the eventual connectivity world, the light still makes complete sense to have, I see the lake as a place to dump data. There, I can read it in a ubiquitous language, in most cases, its sequel that it's exposed, you know, I don't know a single person in our industry that doesn't know sequel to some perspective. So that that is fantastic to have that light there. Where I see the problem often evolving, is that the Lakers is obviously kind of a place where we would typically store raw data. It's where we abstract away the complexity that Oh, now I need if I need data from a SharePoint site, I have to learn the SharePoint API. No. But though, the like is there to basically say, that's already been done. I'm going to give you sequel, that's the way that you're going to get this data, what I find is that when I look at the companies that we work with, is that, yes, but there's a lot that needs to be done from the lake, to where we can actually get the value. I think something like machine learning is a good example, Time after time we hear and it's true that machine learning machine learning doesn't really work that well, if you're not working with good quality, well integrated data that is complete. So it's missing, you know, novels, and it's missing empty columns and things like that. And what I found is that we went through this in our industry, we went through this this period, where we said, okay, well, the lake is going to give the data science teams and the different teams direct access to the role. And what we found is that every time they tried to use that data, they went through the common practices of now I need to blend it. Now I need to catalog now I need to normalize it and clean it. And you could see that the eventual connectivity pattern is there to say, No, no, this is something that sits in between the lake that matures, the data to the point where it's already blended. And that's one of the biggest challenges I kind of see there is that, you know, if I get, you know, a couple of different files out of, of the lake, and then I go to investigate how this joins together, I still have this, you know, this experience of all, this doesn't easily blend together. So then I go on this exploratory this discovery phase of what other datasets Do I have to use to string these two systems together? And we would call it just like to eliminate that.
Tobias Macey
0:22:46
So to make this a bit more concrete for people who are listening and wondering how they can put this pattern into effect in their own systems, can you talk through an example system architecture and data flow for use case that you have done or at least experimented with to be able to put this into effect and how the overall architecture plays together to make this a viable approach? And how those different connection points between the data systems end up manifesting?
Unknown
0:23:17
Yeah, definitely I. So maybe it's good to use an example, imagine you have three or four data sources that you need to blend, you need to adjust it, you then need to usually merge the records into kind of a flat, flat, unified data set. And then you need to, you know, push this somewhere. So you might be a data warehouse, something like Big Query or redshift, etc. And the fact is that, you know, in today's world, that data also needs to be available for the data science team. And now it needs to be available for things like exploratory business intelligence. So when you're building your integrations, I think architecturally from up from a modeling perspective, the three things that you need to think about what we call entity codes, aliases, and edges, and those three pieces together is what we need to be able to map this properly into a graph store. And simply put an entity code is is kind of a absolute unique reference to an object, as I alluded to before, something like a passport number. And that's a unique reference to an object that by itself, just the passport number, and doesn't mean that it's unique across all the systems that you have at your workplace.
0:24:42
So the other is aliases. So aliases is more of like this, this email, a phone number, a nickname, they're all alluding to some type of overlap between these records, but they're not something that we can just honestly go ahead and do hundred percent merge records based off these. Now, of course, for having that you, of course, then need to investigate things like inference engines to build up, you know, confidence on how confident Can I be that a person's nickname is is is unique in the reference of the data that we've plugged in these three or four data sources that I'm talking. And then finally, the edges that they're placed, essentially, and they're there to be able to build referential integrity. But what I find architecturally is that when we're trying to solve data problems for companies, and majority of the time, their model represents much more network than the classic relational store or column, the database or documents store. And so when we look at the technology that's that's needed to, you know, support the system architecture, one of the key technologies at the heart of this is a graph database. And to be honest, it doesn't really matter which graph database you use. But it is kind of what we found important is that it needs to be this a native grass store. There are triplet stores out there, there are multi mode databases like Cosmos DB and SAP HANA, but what we found is that you really need a native graph to be able to do this properly. So the way that you can conceptualize the pattern is that every record that we pull in from a system or that you import, it will go into this network or graph as a node. And every entity code for that record, ie, the unique ID, or multiple unique IDs of that record, they will also go in as a node connected to that record. Now, every alias will go in as a property on that original node, because we want to probably do some processing later to figure out if we can turn that alias into one of these entity codes or these unique references. And here's the interesting part, this is the this is the part where the eventual connectivity pattern kicks in all the edges, I if I was, you know, referencing a person to accompany that that person works at a company. Now those edges are placed into the graph. And a node is created, but it's marked as hidden. Now we call those shadow nodes. So you can imagine if we brought in a record on, on Barack Obama, and a tad barracks or phone number, now, that's not a unique reference. But what we would do is we would create a record a node in the graph, that's referring to that phone number, link it to Obama, but mark the phone number node as hidden, as I said, Before, we call the shadow nose. And essentially, you can see that as one of these hooks that, ah, if I ever get other records that come in later that also have an alias, or an entity code that overlap. That's where I might need to start doing my merging work. And what we're hoping for. And this is what we see time after time as well, is that as we import system, once data, it'll start to come in, and you'll see a lot of nodes that are the shadow nodes, ie, I have nothing to hook onto on the other side of this, this ID. And the analogy that kind of we use for this this shadow node is that, you know, records come in there by default, a clue. So a clue is in no way, factual, in no way Do we have any other records the correlating to the same values. And our goal is to turn in this eventual connectivity pattern, clues to facts. And what makes facts is records that have the same entity code that exists across different systems. So the architectural key patterns to this is that a graph store needs to be there to model out data. And here's one of the key reasons. If I realized that the landing zone of this integrated data was a relational database, I would need to have an upfront schema, I would need to specify how these objects connect to each other. What I've always found in the past is that when I need to do this, it becomes quite rigid. Now, I believe I'm a strong believer in every database needs a schema at some point, or you can't scale these things. But what's nice about the graph is that one of the things that got really old design patterns that have got really well was flexible data modeling, there is no necessarily more important object that exists within the graph structure. And they're all equal in their complexity, but also in their importance, and really pick and choose the graph database that you want. But it's one of the keys to this architectural path. So
Tobias Macey
0:30:07
one of the things that you talked about in there is the fact that there's this flexible data model. And so I'm wondering what type of upfront modeling is necessary in order to be able to make this approach viable? I know, you talked about the idea of the entity codes and the aliases. But for somebody who is taking a source data system and trying to load it into this graph database in order to be able to take advantage of this eventual connectivity pattern, what is the actual process of being able to load that information in and assign the appropriate attributes to the different records and do the different attributes in the record? And then also, I'm wondering if there are any limitations in terms of what the source format looks like, as far as the serialization format or the types of data that this approach is viable? For?
Unknown
0:31:03
Sure. Good question. So I think the first thing is, is to identify that the eventual connectivity pattern and modeling it in the graph, the key to this is that there will always be extra modeling that you do after this step. And the reason why is because if you think about the data structures that we have, as engineers, the network or the graph is the highest fidelity data structure we have. It's hot, it's a higher or more detailed structure than a tree, it's more structured than a hierarchy, or relational stone, definitely more, we have more structure or fidelity, then something like a document. With this in mind, we use the eventual connectivity to solve this piece of integrating data from different systems and modeling it. But we know we will always do better modeling for the purpose fit case later. So it's, it's worth highlighting that the value of the eventual connectivity patent is that it makes the integration of data easier, but this will definitely not be the last modeling that you would have. And therefore this allows flexible modeling. Because you always know, hey, if I'm trying to build a data warehouse, based off the data that we've modeled in the graph, you're always going to do extra processes after it to model it into the probably the relation store for a data warehouse or a column, you're going to model it purpose fit to solve that problem. However, if what you're trying to achieve with your data is flexible access to data to be able to feed off to other systems, you want the highest fidelity, and you want the highest flexibility in modeling. But the key is that if you were to drive your data warehouse directly off this graph, it would do do a terrible job. That's not what the graph was purpose built for the graph was always good at flexible data modeling, it's always good at being able to join records very fast. And I mean, just as fast as doing an index look up. That's how these native graph stores have been designed. And so it's important to highlight that in the upfront modeling, really, it's not a lot of upfront modeling. Of course, we shouldn't do silly things. But I'll give you an example. If I was modeling a skill, a person and the company, it's completely fine to have a graph where the skill points to the person, and the person points to the organization. And it's also okay to have that the person points to the skill and the skill points to the organization. That's not as important. What's important at this stage is that the eventual connectivity pattern allows us to integrate data more easily. Now, when I get to the point where I want to do something with that data, I might find that, yes, I actually do need an organization table, which has a foreign key to person, and then person has a foreign key skill. And that's because, you know, that's typically what a data warehouse is built to do. It's to model the data perfectly. So if I have a billion rows of data, this report still runs fast. But we lose that kind of
0:34:34
that flexibility in the data modeling Now, as for formats and things like that, what I found is that to some degree that the formatting and and the source data, where you could probably imagine the data is not created equally, right. But so for many different systems, they'll allow you to do absolutely anything you want. And where the kind of ATL approach allows you to, to, you know, or kind of dictates that you capture these exceptions up front of if I've got a certain looking data coming in, how does it connect to the other systems, what eventual connectivity does is it just catches them later in the process. And my thoughts on this is that, to be honest, you will never know all these business rules up front, and therefore kind of let's embrace an integration pattern that says, hey, if the schema in the source or the format of the data changes, you kind of alluded to this before as well, Tobias is okay got it, I want to be alerted that there is an issue with dc dc realizing this data, I want to start queuing up the data in a message queue or maybe a stream. And I want to be able to fix that schema and a platform to be able to say, got it now that that's fixed, I'll continue on with see realizing the things that will that will now serialized and these kind of things will happen all the time, I think I've referred to it before and heard other people refer to it as schema drift. And this will always happen in source and in target. So what I found success with is embracing patterns, where failure is going to happen all the time. And when we look at the ATL approach, it's much more of a when things go wrong, everything stops, right that the different workflow stages that we've put into our kind of classical ATL designers, they all go read, read, read, read, I have no idea what to do. And I'm just going to kind of fall over. And so what we would rather is a pattern where it says, got it scheme has changed, I'm going to load up what you need to do until the point where you've changed that schema. And when you put that in place also, I'll test it outside. Yeah, that schema that seems to be I can see realize that now I'll continue on. And what I find is that if you don't embrace this technology spend most of your time in just re processing data through ETFs.
Tobias Macey
0:37:13
And so it seems that there actually is still a place for workflow engines or some measure of ATL where you're extracting the data from the source systems. But rather than loading it into your data lake or your data warehouse, you're adding it to the graph store for then being able to pull these mappings out and then also potentially going from the graph database where you have joined the data together, and then being able to pull it out from that using some sort of query to be able to have the maps data extracted, and then load that into your eventual target.
Unknown
0:37:49
I mean, what you've just described, there is a workflow, and therefore, you know, the workflow systems, they still make sense, they're very logical to look at these at these workflows and say, oh, that happens, then that happens, then that happens, they completely still make sense. And I still actually use in some cases, I actually still use the ZTL tools for very specific jobs. But what you can see is if we were to use these kind of classical workflow systems, and you can see the eventual connectivity pattern, as you described, it's just one step in that overall pattern, that I think what I found over time is that, no, we use these workflow systems to be able to join data. And I would, I would actually rather throw it to a an individual step called eventual connectivity, where it does the joining and things like that for me, and then continue on from there, you could very similar to the kind of example you gave is, and that I've also been been mentioning here as well as there will always be things you do after the graph. And that is something you could easily push into one of these workflow designers. Now, as for an example of, you know, the the times when our company still uses these, these these tools out of our customers, I think one of the ones that makes complete sense is IoT data. And it's mainly because it's not typical, in at least the cases that we've seen, that there's as much hassle in blending and cleaning data, we see that more with, you know, operational data, things like transactions, and, you know, customer data and customer records, that typically quite hard to blame. But when it comes to IoT, IoT data, you know, if there's something wrong with the data that it can't blend, it's often that, well, maybe it's a bad reader that we're, you know, reading off instead of something that is actually dirty data. Now, of course, every now and then, if you've worked in in that space, you'll realize that, you know, readers can lose a network can make and, you know, have holes in the data. But I mean, eventually connectivity would not solve that either, right. And typically, in those cases, you'll do things like impute the values from historical and future data to fill in the gaps. And it's always a little bit of a guess that's why it's, it's it's way puting it. But, to be honest, if it was my task to build a unified data set from across different sources, I would just choose this eventual connectivity pattern every single time, if it was to have to a workflow of data processing, where I, I know that data blends easy, then there's not a data quality issue, right? Where there is, you need to jump across multiple different systems to merge data, I just, Time after time have found that, you know, these workflow systems, they reach their limit where it just becomes too complex.
Tobias Macey
0:40:53
And for certain scales or varieties of data, I imagine that there are certain cases that come up when trying to load everything into the graph store. And so I'm wondering what you have run up against as far as limitations to this pattern, or at least alterations to the pattern to be able to handle some of these larger volume tasks.
Unknown
0:41:15
I think I'll start with this, the graph is notoriously hard to scale. And most of the graph databases that I've had experience with and you're essentially bound to one big graph, so there's no i, there's no kind of idea of clustering these data stores with, you know, some graphs that you could query a cost. So scaling that is actually quite hard to start with. But I think the limitations from the pattern itself, there are many, I mean, it starts with the fact that you need to be careful, I'll give you a good example, I've seen many companies that use this pattern, and they flag something like an email is unique. And then we realized later modern, no, it's not, we have merged records that are not duplicates. And this means, of course, that you need support in the platform, that you're you're utilizing the ability to split these records and fix them and reprocess them at a later point. But I mean, these are also things that will be very hard to pick up an EGLLT types of battles. But I think one of the other you know, downsides of this approach is that up front, you don't know how many records will join your kind of like the name alludes to, eventually, you'll get joins or connectivity. And you can think of it as this pattern will decide how many records it will join for you based off these entity codes or unique references, all the power of your inference engine, when it comes to things that are a little bit fuzzy, unique, a fuzzy ID to someone things like you know, phone numbers or things like that. The great thing about this, it also means that you don't need to choose what type of join that you're doing. I mean, in the relational world, you've got plenty of different types of joins, your Inner Joins, outer joins, left outer Left, Right outer joins things like this. In the graph, there's one join, right? And so with that pattern, you know, it's not like you can pick the wrong join to go with there's only one type of thing. So if it really becomes useful when No, no, I'm just trying to merge these records, I don't need to hand hold how the joins will happen. I think one of the other downsides that I've had experience with this is that, let's just say you have, you know, system one, and two. And what you'll often find is that when you integrate these two systems, you have a lot of these shadow nodes in the graph, ie, sometimes we call them floating edges, where, hey, I've got a reference to accompany with an idea of 123. But I've never found the record on the other side with the same ID. So I have, you know, in fact, I'm storing lots of extra information that, you know what, I'm not actually utilizing it. But I think the advantages of saying, Yeah, but you will integrate system for you will integrate system five where their data sits. But the value is that you don't need to tell the system, how they join units need to flag these unique references. And I think that the kind of final maybe limitation or that i think that i found with these patterns is that you learn pretty quickly, as I alluded to, before, that there are many records in your data sources where you think a column is unique, but it's not
Tim Ward
0:44:45
it might be unique in your system,
Unknown
0:44:47
ie in Salesforce, the ID is unique. But if you actually look across the the other parts of the stack, you realize, no, no, there is a another company in another system with a record called 123. And they have nothing to do with each other. And so what we, you know, these entity codes that I've talking about, they're made up of multiple different parts they made up of the id 123. They're made up of a type something like organization, and they're made up of a source of origin, you know, Salesforce account 456. And so what this does is it guarantees uniqueness, if you added in to Salesforce accounts, or if you add in systems that have the same ID, but it came from a different source. And as I said before, good example would be the email. I mean, even at our company, we use GitHub enterprise to be able to store our our source code. And you know, out we have notifications that our engineers get when you know, there's pull requests and things like this. And it actually a GitHub identifies each employee as notifications at GitHub. com, that's what that record sends us as its unique reference. And of course, if I marked this as a unique reference, all of those employee records using this pattern would merge. However, what I like about this approach is that, you know, at least I'm given the tools to rectify that the bad data when I see it. And to be honest, if companies are wanting to become much more data driven as kind of we aim to help our customers with, and I just believe that it means we have to start to shift or learn to accept that there's more risk that could happen. But is that risk of having data, you know, more readily available to the forefront worth more than the old approaches that we're taking?
Tobias Macey
0:46:46
And for anybody who wants to dig deeper into this idea, or learn more about your thoughts on that, or some of the Jason technologies, what are some of the resources that you recommend they look to?
Unknown
0:47:00
So I mean, I guess the first thing and Tobias you and I have talked about this before is that, I think the first thing that that the white to kind of learn more about it is to kind of get in contact and kind of challenge us on this idea. I mean, when you you know, when you see a technology and you're an engineer, you go out and start using it, you have this tendency to kind of gain a bias around that, that, you know, Time after time you see it working. And then you you think, why, why is not everybody else doing this? And actually, the answer is quite clear. It's because well, things like graph databases, were not as ubiquitous as they are right now. You know, you can get off the shelf free graph databases to use and, you know, 1010, even 10 years ago, this was just not the case, you would have to build these things yourself. And so I think the first thing is, you know, you can get in touch with me at TIW included.com, if you're just interested in challenging this, this design pattern, and really getting to the crux of, really is this something that we can replace ATL with completely, I think the other thing you mentioned the white paper that you alluded to, that's available from our website. So you can always jump over to clued in.com, to read that white paper, it's completely open and free to everyone to read. And then we also have a couple of YouTube videos. If you just type, including I'm sure you'll find them. And where we talk in depth about, you know, utilizing the graph to be able to merge different data sets. And we really go into depth. And but I mean, I always like to talk to other engineers and have them challenge me on this. So feel free to get in touch. And I guess if you're wanting to learn much more, we also have our developer training that we give here, including which, you know, we compare this pattern towards, you know, other patterns that are out there, and you can get hands on experience with taking different data sources, taking the multiple different approaches is that are out there is integration patterns, and really just seeing the one that works for you.
Tobias Macey
0:49:04
Is there anything else about the ideas of eventual connectivity or ATL patterns that you have seen, or the overall space of data integration that we didn't discuss yet that you'd like to cover? Before we close out the show?
Unknown
0:49:16
I think for me, I always like when I have more engineering, patents and tools on my tool belt. So I think for me, the thing for listeners to take from this is that use this as an extra little piece on your tool belt, if you find that you walk into, you know, a company that you're helping, and they say, Hey, listen, we're really wanting to start to do things with our data. And they say, yeah, we've got, we got 300 systems. And to be honest, I've been given the direction to to kind of pull and wrangle this into something we can use, really think about this eventual connectivity pattern really investigated as a possible option, it's actually that to implement the pattern you can, you'll be able to see it in the white paper. But to implement the pattern yourself, it's really not complex. It just like I said before, one of the keys is to just embrace maybe a new database family to be able to model your data. And yes, get get in touch if you need any more information on.
Tobias Macey
0:50:22
And one follow on from that, too, I think is the idea of migrating from an existing ETL workflow and into this eventual connectivity space. And it seems that the logical step would be to just replace your current target system with the graph database and adding in the mapping for the entity IDs and the aliases. And then you're sort of at least partway on your way to being able to take advantage of this and then just adding a new ATL or workflow at the other end to pull out of the connected data into what you original target systems were. Yeah, exactly.
Unknown
0:51:02
I mean, it's it's, it's, it's quite often we walk into a business, and they've already got many years of business logic embedded into these ETFs pipelines. And, you know, my, my idea on that is not to just throw these away, there's a lot of good stuff there. It's really just complemented with this extra design pattern. And that's probably a little bit better at the whole merging and DJ application of data.
Tobias Macey
0:51:29
All right? Well, for anybody who wants to get in touch with you, I'll add your email and whatever other contact information to the show notes. And I've also got a link to the white paper that you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Unknown
0:51:49
Well, I would say a little bit off topic, but I would actually see, say that I'm amazed how many companies please I walk into and they don't know, what is the quality of the data they are working with. So I think one of the big gaps that needs to be fixed in the data management market is to be able to integrate data from different sources to be explicitly told via different metrics. I mean, the classic ones that were used to be accuracy and completeness and things like this, for businesses to know, what are they dealing with? I mean, just that simple fact of knowing, hey, we're dealing with 34% accurate data. And guess what, that's what we're pushing to the data warehouse to build reports, and that our management is making key decisions off. So I think, first of all the gap is in knowing what quality of data you're dealing with. And I think the second piece is in facilitating the process around how do you increase that a lot of these things can often be fixed by normalizing values, you know, if I've got two different names for a company, but thou the same record, which one do you choose? And do we normalize to the valley that's uppercase, or lowercase or title case, or the one that has a, you know, Incorporated at the end? And I think that that part of the industry does need to get better.
Tobias Macey
0:53:20
All right. Well, thank you very much for taking the time today to join me and discuss your thoughts on differential connectivity and some of the ways that it can augment or replace some of the ETFs patterns that we have been working with up to date. So I appreciate your thoughts on that. And I hope you enjoy the rest of your day.
Tim Ward
0:53:37
Thanks, Tobias.

Straining Your Data Lake Through A Data Mesh - Episode 90

Summary

The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?
    • What are some of the organizational and industry trends that tend to lead to this solution?
  • You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?
    • In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?
  • What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?
  • One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?
    • A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
      • Who is responsible for implementing and enforcing compliance regimes?
  • One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?
    • How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
      • Has latency of data retrieval proven to be an issue in your work?
  • When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?
    • How do team structures and dynamics shift in this scenario?
    • What are the necessary skills for each team?
  • Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?
  • Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?
  • For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?
  • Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto into Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities with the startups that are changing the world than Angel list is the place to go go to data engineering podcast.com slash angel to sign up today. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing reshare Mark de Gani about building a distributor data mesh for a domain oriented approach to data management. So Jamal, can you start by introducing yourself?
Zhamak Dehghani
0:02:05
Hi, Tobias. I am Gemma honey. And I am the technical director at thought works. a consulting company we do software development for many clients across different industries. I'm also a part of our Technology Advisory Board. That gives me the, I guess, the privilege and the vantage point to see all the projects and technology that we use globally, across our clients and digest information from that unpublished as part of our tech technology radar twice a year.
Tobias Macey
0:02:39
And do you remember how you got involved in the area of data management? Sure.
Unknown
0:02:42
So I guess, I've done earlier in my career, I have build systems, that part of what they had to do was, you know, as part of the kind of network management systems many years back
Zhamak Dehghani
0:02:57
collect,
Unknown
0:02:58
you know, real time data from data, we're set of devices across, you know, hundreds of nodes, and kind of manage that they just that act on it, it was part of some sort of a network management observe ability. However, post that experience, most of my focus had been on distributed architecture, at scale, and mostly on operational systems. So data was something that was hidden, you know, inside Operational Services in a distributed system, to do what to support what those operational systems need to do the last two years. And the last few years, both being on the tech advisory board for thought works, and also working with clients on the west coast of us where I'm located, I had the opportunity of working with teams that were building data platforms. So I support different technical teams that we have on different clients. And sometimes I go really deep on one client, and sometimes I've come across multiple projects. So I came from a slightly left field from, you know, distributed systems on operational systems to people who've been building, you know, who've failed, were struggling building kind of the next generation data platform for large retail or people who have been involved working with teams who are building the, you know, the the next generation analytical platform for one of the texture ends here, down in San Francisco. And struggling to scale that. So I started, I guess, working in the last couple of years working with teams that were heavily involved in, recovering from the past failures, and, you know, data warehousing data platforms, and building kind of the next generation and seeing the challenges and the experiences that we're going through, I'm sorry, I can't name many of these clients that I work with. But they often fall into the category of, you know, large, large organizations with fairly rich domains and rich data sets, where, you know, the state of the data is relatively kind of unpredictable, or has poor quality, because organizations have been around for a long time. And they've been trying to kind of harness the data and use it for for many years.
Tobias Macey
0:05:18
And so you recently ended up writing a post that details some of the challenges associated with data lakes and the monolithic nature of their architectural pattern and proposing a new approach that you call a data mash. So can you start by providing what your definition is of a data lake, since there are a lot of disagreements about that in the industry and discussing some of the problems and challenges that they pose as an architectural and organizational pattern?
Unknown
0:05:47
Sure. So maybe I give you a little bit of a historical perspective on the definition where it started, and what we see today as a general patterns. So I think, you know, data lake started the tech term that was coined by James Dixon in 2010. And what he envisage at the time was, was an alternative to data warehousing approach. But what he suggested was, data lake is a place for a single data source, to provide its data in its raw format. So that other people, whoever the consumers are, can decide how they want to use it, how they want to model it. So there, there was no, you know, the contrast was with data warehouse, in a way that, you know, data warehouse was producing pristine bottle data that were that they were, well model, they had very strong schemas and designed to address very specific, I guess, use cases and needs around business intelligence and reporting and so on. Data Lake based on James six and initial, I guess, definition was place that a single data source provide on bottle data, raw data, and that other people can use later on, he actually corrected and enhance that definition saying, what he intended for data lake to be to be one place for raw data from a single data source and an aggregator of data sources into, you know, raw data into one place from multiple data sources, what he called water garden, but I think the, he's, you know, he's thinking around raw data versus bottle data stirred people's imagination. And what data lake turned out to be is a place that aggregates raw data from all of the data sources, you know, all corners of the organization into one place. And then on to allow mostly analytical uses usage. And you know, data scientists kind of diving in and find out what they need to find out. And then it has evolved, even from then onwards, like, if you see, for example, different data lake solution providers, and you look at their websites, whether there are, you know, a famous cloud providers, like as your Google's so on, or, you know, other service providers, the carrot, I guess, recombination of a data lake is not only a place that right, broader data from all the sources in the organization would land. But also, it's a single place that, you know, pipelines to cleanse and transform and provide different views and different access on that data also exists as part of the pipeline of that data lake, I use the data lake, kind of, metaphorically, I meant by by data lake, really the incarnation of the current data lake as a single, centralized and kind of monolithic data platform to provide data for a variety of use cases, anything from, you know, kind of the business intelligence to analytical to machine learning based, accumulating data from all the domains and all the sources in the organization. So based on that definition, we can, I guess, jump into second part of your questions about some of the underlying kind of characteristics that these sort of centralized data platforms share.
Tobias Macey
0:09:18
Yeah. So I'm interested in understanding some of the organizational and technical challenges that they pose, particularly in the current incarnation of how everybody understands a data lake as being essentially a dumping ground for raw data that you don't necessarily know what to do with it yet? Or do you just want to be able to archive it essentially, forever? And also some of the organizational and industry trends that have led us to this current situation?
Unknown
0:09:46
Yeah, absolutely. I think I think since of so I wrote this, I can go into that in a minute. But I want to preface that by kind of the litmus test that this article has given me to validate whether the challenges that I had seen was widely applicable, you know, globally or not. So going to the challenges for a minute, but I want to, I guess, give a little bit of information on what has happened since I've written this article. So the challenges that I've observed that we can share in a minute, technical, their organizational challenges, I observed working, you know, with a handful of clients, and also seeing second hand from the projects that were running globally. However, since I've written the article, there are 10s of calls that I received from different people, different organization, kind of sharing the challenges that they have, and how much this had resonated with them. And it was a confirmation that it wasn't just a problem in a pocket that I was exposed to, and it's more widely spread. I think some of the underlying challenges are related. So So one is the same terms that we see. Right? So the symptoms, the problems that we see the symptoms are around mostly,
0:11:07
how do I bootstrap building a data lake? How do I, it's a big, you know, it's a big part of your infrastructure architecture is a central piece of your architecture, but yet, it needs to facilitate collaboration across such a diverse set of stakeholders in your organization, people that are building your operational systems that are collecting the source data, you know, generating, essentially that source data as the byproduct of what they're doing. And those stakeholders really have no incentive in providing that by product or by that data for other people to consume, because there is no direct incentive or feedback cycle into where the sources are. So there is this the central piece of architecture need to collaborate with diverse set of people that are represented source data, and also a diverse set of teams and business units that represent the consumers of the data, and all the you know, kind of possible use cases that you can imagine from company to become data driven, essentially, to make intelligent decisions based on the data. So the the symptoms that we see are either, you know, people are stuck bootstrapping, building such a monolith, and creating those all those points of integration and points of collaboration, or people that have succeeded to create some generation or some form of that data lake or data platform, they have failed to scale, either onboarding, you know, new data sources and deal with the proliferation of the data sources in organization, or they have failed to scale responding to the consumption models to different access points in the consumption models for that data. So you end up with this kind of centralized, rather stale, incomplete data set that really doesn't solve, you know, a diverse set of use cases, it might solve, you know, narrow set of use cases. And the fourth, I think, kind of failure symptom that I see is that, okay, I've got a data platform in place, but how do I change my organization to actually work differently, and use data for, you know, decision, intelligent decision making, you know, take augment applications to become smarter and use that data? So I just put the fourth, maybe failure mode aside, because there are a lot of cultural issues associated with that. And I want to focus on perhaps more architecture, I know, you can't really separate architecture from organization because they kind of mirror each other and focus on what are the root causes? What are the fundamental characteristics that any centralized data solution, whether it's the warehouse, or lake or your next generation be cloud based data platform share, that leads to those symptoms? And I think the first and foremost, is this, this assumption, that data needs to be in one place. And when you put things in one place, you create one team around managing it, organizing it, you know, taking it in, digesting it and serving it. And that fundamental centralized view. And centralized implementation is, by default, a bottleneck for any sort of scale of operations. So that limits how much organizations can really scale operationally use an intake of the data once you have a centralized team and a centralized architecture in place. The other characteristics that I characteristic that I think is leading to those problems, that's the single first and foremost thing is centralization, that, that conflicts, different capabilities. When you create one piece of you know, central architecture. And especially in the Big Data space, I feel there are two separate concerns that get conflated as one thing and has to be owned by one team. And one is the infrastructure, the underlying data infrastructure that serves, you know, hosting or transforming with access to the big data at scale. And the other concern is that domain, what are the domain data in the raw format or an aggregator model format, what are these domains that we actually try to put together in one place, and those concepts, the domains, and also the separation of infrastructure from the data itself gets completed in one place, and that becomes another, you know, point of contention and lack of skill. And that leads to leads to other challenges around, you know, siloed in people and siloed skill sets, that has their own, you know, impact that leads to kind of unfulfilled promise of big data by silencing people based on technical skills, you know, a group of data engineers or ml engineers, just because they know certain tools set around managing data from the domains where the data actually gets produced. And the meaning of that data is known, and separating them from domains that are actually consuming that data. And they are more intimately familiar in with how they want to use the data and separating them into a silo. That's another point of pressure for the system, which also leads to other you know, I guess, human impact, like the lack of satisfaction and the pressure that these teams are under, they don't really understand the domains and they get subjected to providing support and consuming ingesting this data making sense of it, and how fragile that interface between the operational systems and the data lake, Big Data Platform is because those operational systems and the lifecycle of that data changes very differently based on the needs of those operational domains, to the data that the team is consuming. And that becomes a, you know, a fragile point that you continuously playing catch up in the data team is continuously playing catch up with the changes that are happening with the data and the frustration of the consumers. Because the data scientist or a male engineers, or people that business analyst, I want to use the data are fully dependent on the silo data platform engineers to provide the data that they need in the in the way that they need. So you have, you know, a set of frustrated kind of consumers on the other side and poor data engineers in the middle trying to kind of work under this pressure. So there is I think there's a human aspect to that as well.
Tobias Macey
0:17:18
Yeah, it's definitely interesting seeing the parallels between the monolithic approach of the data lake and some of the software architectural patterns that have been evolving of people trying to move away from the big ball of mud, because of the fact that it's harder to integrate and maintain, and that you have the similar paradigm and the data lake where it's hard to integrate all the different data sources in the one system, but also between other systems that might want to use it downstream of the data lake, because it's hard to be able to separate out the different models or version, the data sets separately or treat them separate, because they're all located in this one area.
Zhamak Dehghani
0:17:56
I agree. I think I think
Unknown
0:17:58
the reason I have a hypothesis as why this is happening, why we are not taking learnings from the monolithic, you know, operational systems and their lack of scale and the you know, the the human impact of that, and not bringing those learnings to the data space, I definitely see, as you mentioned parallels between the two. And that's where it came from, I came kind of a bit of left field to this. And and to be honest, when I wrote this article, when I started first talking about it, I had no idea that I'm going to, you know, offend a large community and I'm going to get sharp objects thrown at me, or is this going to resonate? And luckily, it's been on the, you know, the latter side and, and a more positive feedback. So I think definitely there are parallels between the two, my hypothesis is why data engineering or beta data platform has been stuck at least six or seven years behind what we've learned in distributed systems architecture and more complex system design is that the evolutionary you know, thinking and the evolutionary approach to improving data platform is still coming from the same community. And the same frame of thinking that the data warehouse is built upon, even though we have embed improvements, right, we have changed. ATL like extract, transform load to extract low transform, LTS, essentially with data lake. But we are still fundamentally using the same construct, if you zoom into what the data warehouses, stages were, you know, in that generation data platform, and you look into, you know, even Hadoop based or whatever they did late based models of today, these still have similar fundamental constructs such as pipelines, ingestion, cleansing, transformation, serving as major first level architectural components of the Big Data Platform, we have seen layered, you know, functionally debate divided architectural components. Back in the day, when we try to, you know, 10 years ago, the couple manipulates the very first approach that enterprises to took back in the operational systems, when they were thinking about how the heck I'm going to break down this monolith to some sort of architectural quantum that I can independently somehow evolved was, well, I'm going to layer this thing, I'm going to put a UI on top and a business, you know, kind of process in the middle and data in the middle. And I'm going to bring all my DBS together to have to own this and manage this centralized database that is, you know, centrally managing data from different domains. And I'm going to structure the organization a structure, you know, my organization in layers. And that really didn't work. Because the change happen doesn't happen in one layer, how often do you replace your database, the change happens orthogonal Lee to these layers across them, right? When you build a feature, you probably need to change your UI and your business model and your data at the same time. And that led to this thinking that you know what, in a modern digital world, businesses moving fast, and the change is localized to those business operations. So let's bring this kind of domain driven thinking and build these microservices around the domain where the changes localized, so we can independently make that change. I feel like we are kind of following the same footsteps. So we've come from the data warehouse and a big data, you know, place and one team to rule them all. And then we're scratching our head to say, Okay, how I'm going to turn this into architectural pieces. So I can somehow scale it and well invert, you know, flip the layered model 90 degrees and tilt your head. And what you see is a data pipeline, I'm going to create a bunch of services to ingest and a bunch of services to transform and a bunch of services to serve, and have this data marts and so on. But how does that scale that doesn't really scale because you still have this point of handshake and point of friction between these layers to actually meaningfully create new data sets, create new access points, create new, you know, features in your data, your introduce new data, data products, in a way. And I think that's hopefully, we can create some cross pollination across the thinking that happening in operations and bring into the data, data. And that's what I'm hoping to do with the database was to bring those paradigms and create this new model, which happens at the intersection of those disciplines, we applied in operational domains to the world of data so we can scale it up.
Tobias Macey
0:22:24
And another thing that's worth pointing out from what you were saying earlier, is the fact that this centralized data lake architecture is also oriented around a centralized team of data engineers and data professionals. And part of that is because of the fact that, you know, within the past few years, it's been difficult to find people who have the requisite skill set, partly because we've been introducing a lot of new different tool chains and terminologies. But also, because we're trying to rebrand the role definitions. And so we're not necessarily as eager to hire people who knew who do have some of the skill sets, but maybe have gone under a different position title, whether it's a DPA or, you know, maybe systems administrators and trying to convert them into this new role types. And so I think that that's another part two of what you're advocating for with this data mash, and the realignment of how the team is organized in more of a distributed and domain oriented fashion and being embedded with the people who are the initial progenitors of the data and the business knowledge that's necessary for being able to structure it in a way to make it useful to the organization as a whole. So I guess you if you can talk through what your ideas are in terms of how you would organizationally structure the data mesh approach, both in terms of the skill sets and the team members, and also from a technical perspective, and how that contributes to a better approach for evolutionary design of the data systems themselves and the data architecture.
Unknown
0:23:59
Sure, I think, the point that you made around, you know, skill set and, and not having really created career paths for either software generalist to have the knowledge of all the tooling required to, you know, to perform operations on data and providing data as a first class asset, or, you know, or really siloed people into a particular role like data engineers who have chosen that path, perhaps from maybe a DB or data warehousing path, and not having had the opportunity to really perform as a software engineer
Zhamak Dehghani
0:24:39
and really act
Unknown
0:24:40
put the hat of a software engineer in place. And that has resulted into in a to a, you know, a silo of skill set, but also lack of maturity. So I see a lot of data engineers that are, you know, wonderful engineers, they knew their tools that they were using, but they haven't really adopted best practices of software engineering in terms of burgeoning, you know, continuous delivery, like presenting of the data, continuous delivery of the data, these concepts are not well established, or well understood, because they're basically evolving the operational domain, not the in the Data Domain. I many, you know, wonderful software engineers that haven't had the opportunity to learn spark and Kafka and you know, stream processing, and so on, so that they can add that to their tool belt. And I think for future in future generation of software engineers, hopefully not so far in the next few years. And any generalist kind of full stack software engineer will have, you know, the toolkits that they need to know and have expertise in to manipulate data in their tool belt. I'm really hopeful for that there is an interesting statistic from LinkedIn, and if I remember correctly, from 2016, and I doubt that has changed much since then they had 65,000 people that had declared them as data engineers on their site. And there were only there were 60,000 jobs available for data looking for data engineers only in the Bay Area. So that shows the discrepancy between what the industry means and what we are, you know, enabling our people, our engineers to to fulfill those roles. Going back to your that was sorry, my little rant about career paths and developing data engineer. So I personally with my team's
Zhamak Dehghani
0:26:30
welcome, the data enthusiasts
Unknown
0:26:33
that they want to become data engineers and provide career paths and opportunities for them to evolve in that role. And I hope other companies would do the same. So going back to a question around kind of what's the database? What are those fundamental organizational constructs and technical constructs that we need to put in place, I'm going to talk about the fundamental principles, and then hopefully, I can bring it together in one cohesive sentence to describe it. The first one fundamental principle behind data mesh is that data is owned and organized through the domains. For example, if you are in, let's say, in a health insurance domain, the claims and all the operational systems and you probably have many of them, that generate raw data about the claims that the members have put through that raw data should become the first class concern in your architecture. So the domain data, the data constructed or generated around a domain concepts such as claims, such as members, you know, such as clinical visits, these are the operational, these are the first class concerns in your structure, and in your architecture, which I call them domain data products in a way, the second and that comes from, you know, domain driven kind of distributed architecture. So what does that mean? That means that, at the point of origin systems that are generating the data, they're they're representing the facts of business as we are operating the business, such as events around claims, or perhaps even historical snapshots of the claims, or some current state of this claims. As they're providing those teams, the teams that are most intimately familiar with that data, are responsible to providing that data in a self serve consumable way to the rest of the organization. So the ownership, the one of the constructs of principles is that the ownership of data is now distributed. And it's given to people who are best suited to know and own that data. So that ownership can happen at multiple places, right, you might have your source operational systems that would now own a new data set or streams of data, whatever format is most suitable to represent that domain to own that data. And I have to clarify that that broad data that those you know, systems of origin generate, we're not talking about their internal operational database, because internal operational databases and data sets are designed for a different purpose and intent, which is make my system work really well. They're not designed for other people to get a view of what's happening in the business in that domain, and capturing those facts and realities. So it's a different data set, this is different, whether it's a stream, very likely, or it's a time series of whatever format is, is the data that is native data sets that are owned by systems operation, and people who are operating those systems. And then you have data sets that maybe there are more aggregate views, for example, in a domain for, again, health insurance as an example, you might want to have predictive points of intervention, or you know, critical points of contact, that you want the care provider makes, you know, contact with member to make sure that they getting the health care that they need at the right point in time, so that they don't get sick and make a lot of claims on insurance at the end of the day. So that domain itself that is responsible for making those contacts and making those smart decisions and predictability as when and where I need to contact a member, they might produce an aggregate view of the member, which is the historical, you know, records of all the visits that the member is done, and all the claims that says a joint aggregate view of the data. But that data might be useful not only for their domain, but other other other domains. So that becomes another domain driven data that that the team is providing for rest of organization to support. So that's kind of the distributed domain aspect of it. The second principle behind that is for any for data to be really treated as asset for for it to be in a distributed fashion, be consumed by multiple people and still be joined together and filtered and been in a meaningful we aggregated and in a self serve way us, I think we need to bring product thinking to to the world of data. So that's why I call this things kind of domain driven, or domain data products, product thinking in a technology space, what does that mean? That means I need to put all the technology, you know, kind of characteristics and tooling, so that
0:31:29
I can delight the, you know, I can provide a delightful experience for people who want to access the data. So these people might be you might have different types of consumers, they might be data scientists, maybe maybe they're just, they want to just analyze the data and run some queries to see what it's they, they may may not want it, they may want to use that data to convert it to some other, you know, easy to understand way of data to, you know, kind of spreadsheet so that you have this diverse set of consumers for your data sets. But for them for this data set to be considered a product, a data product and bring product thinking, you need to think about, okay, how I'm going to provide discover ability. So that's the first step, how can somebody find my data product. So it's a discovery ability, how can the addresses so they can programmatically use it, standards stability, I need to make sure that I put enough security around it so that whoever is authorized to use it can use it, and they can see things that they should see and not see things they shouldn't see. So the security around it. And well, for this to be self serve as a product, I need to really have good documentation, maybe example data sets that they can just play with and see what's in the data, I need to provide the schema so they can see structurally what it looks like. So all of the tooling around kind of self describing and supporting kind of the, again, the understanding what the data is. So there is a set of practices that you need to apply to data. So for data to become an asset, self. And finally, the third, I think discipline that intersects of this discipline would help the Dana mesh is the platform thinking part of it. So at this point of conversation, usually people tell me Hold on a minute, you've asked me now to have independent teams, all of my operational teams to own their data, and also serve it in such a self serve, you know, easy to use way, that's a lot of repeatable, you know, metal work that these teams have to do to actually get to the point that they can provide the data like all the pipelines that internally they need to build, so that they can extract maybe data from the databases or have a new, you know, event sourcing in place. So that leads into, you know, transmitting or emitting the events, there is a lot of work that needs to go documentation discoverable at, how can they do this, this is a lot of costs, right? So that's where it kind of the data infrastructure or platform thinking comes to play, I see a very specific role in a data mesh in this mesh of, you know, domain oriented data sets for infrastructure for what I call self serve data infrastructure. So all the tooling that we can to put in place on top of our raw data infrastructure, so the raw data infrastructures, you know, your storage and your
Zhamak Dehghani
0:34:21
backbone messaging or
Unknown
0:34:23
backbone, event logs, and so on. But on top of it, you need to have a set of tooling that these data product teams can essentially use easily, so that they can build those data products quickly. So we put a lot of capabilities when we build a database into that layer, which is the self serve infrastructure layer for supporting discover ability, supporting, you know, documentation, supporting secure access, and and, and so on, so that our teams and the success criteria for that infrastructure, tooling, data infrastructure tooling, is that decrease lead to, for a data product team or an operational team to get their data set on to this kind of mesh mesh platform. And, and that's, that's, I think, a big part of
Zhamak Dehghani
0:35:11
it. And
Unknown
0:35:13
I think it's, it's at this point is easy to imagine that, okay, if I have now domain oriented data
Zhamak Dehghani
0:35:20
product, clearly, I need to have
Unknown
0:35:22
people who can bring product thinking to those domains. So you have people that play kind of the data product owners, and they might be the same same person as a tech lead or same person as they, you know, product on every operational system. But what they care about is data as a product they providing to the rest of the organization. So the lifecycle of that data visioning. What features that what sort of, you know, elements they want to put into it? What is the kind of service level objective? In a way? What are the quality metrics that we support, you know, if this time is data is almost near real time probably have missing or has some duplicate events in there. But maybe that's, you know, that's an accepted kind of service level agreements in terms of the data quality, and they think about those, you know, all the stakeholders for that data. And similarly, now, our software engineers, who are building operational systems also have skill sets around using spark or using air flow, or all of the tooling that they need to implement their data products. And similarly, the infrastructure folks that are often you know, dealing with your compute system, and you know, your storage and your build pipelines and have providing those tools as you know, self serve tooling. Now, they're also thinking about big data storage now. So they're thinking about, okay, data discovery. And that's an evolution that we've seen kind of when API, you know, revolution happened, a whole lot of technology came into infrastructure, like the API gateways and API documentation, and so on, that becomes part of the service infrastructure in a way. And I see the same hopefully, the same kind of change with happen with infrastructure folks supporting distributed data mesh.
Tobias Macey
0:37:07
Yeah, I think that, you know, when I was reading through your posts, one of the things that came to mind is a parallel with what's been going on with technical operations, where you have an infrastructure operations team that's responsible for ensuring that there is baseline capacity for being able to provide access to compute and storage and networking. And then you have layered on top of that, and some organizations, the idea of the site reliability engineer that act as a consultant to all the different product teams to make sure that they're aware of and thinking about all of the different operational aspects of the software that they're trying to deliver and drawing a parallel with the data mash, where you have this data platform team that's providing access to the underlying resources for being able to store and process and distribute the data. And then having a set of people who are either embedded within all of the different products teams, or actually as a consultant to the different products, teams to provide education and information about how they can leverage all these different services to make sure that the data that they're providing within their systems and that they're generating within their systems is usable downstream to other teams and other organizations, either for a direct integration or for being able to provide secondary data products from that in a reusable and repeatable manner.
Unknown
0:38:26
Yes, definitely the parallels, and I think we can learn from those lessons, you know, we have gone through the Dev and Ops separation, and we, you know, with DevOps, we brought them together. And as a side effect of that we generated a, we created a wonderful generation of engineers called SRE. So I think, absolutely the same, the same parallels can be applied and those learnings can be applied to the world of data. One thing I would say, though,
Zhamak Dehghani
0:38:54
going into this,
Unknown
0:38:56
you know, distributed model, with distributed owners ship around data domains with different you know, infrastructure, folks supporting that bringing product thinking and platform thinking and all of that together to it. It's a hard journey. And it's counter culture, to a large degree to what we have today in organizations, you know, today, data platform, data lake is a separate organization, all the way to the side, really not embedded into the operational systems, what I like to hope to see is that data becomes the fabric of the organization. Today, we do not argue about having API's, you know, serving capabilities to the rest of the organization and from services that are owned by different people. But for some reason, we argue that that's not a good model for sharing data, which I'm still puzzled by. So data right now is separated, people are still thinking, you know, there's a separate team deal with my data data is like, the operational systems and folks are not really incentivized to own or even provide, you know, kind of trustworthy data, I think we still have a long way to go for organizations to reshape themselves. And there is so much challenge, there is a lot of friction and challenge between operational systems and the data that needs to be unlocked. And the consumption of that in a meaningful way. I definitely I am working with and also have seen pockets of this being implemented by the four people who get started and aware I'm starting also with a lot of our clients is not be is not the one, it doesn't look like a distributed ownership and a distributed team. So on day one, we in fact, do start with maybe one physical team that brings people from those operational domains as SMEs into the team. It brings infrastructure folks into the team. But though it's a one physical team, the underlying tech we use, like the repos for codes that generate data for different data domains are separate repos, they're separate pipeline's, the, you know, infrastructure has a separate backlog for itself. So virtually internally have separate teams, but they're still working on day one, they're still working under one program of work. And hopefully, as we go through the maturity of, you know, building this in a distributed fashion, those internal virtual teams that are, you know, they're they've been working on their separate repo, and they have a very explicit domain that they're looking after, then they can kind of once we have enough of the infrastructure in place to support those, you know, domains to become autonomous, completely autonomous, then they can go out and, you know, be turned into long standing teams that are responsible for that data for that domain. So that's, though that's a target state, I do want to just, I guess, mentioned that there is a maturity and there is a journey that we have to go through and probably make some mistakes along the way, and learn and correct to get to get to that point at scale. But I'm very hopeful, because I have seen this happen and power, as you mentioned, parallel in the operational world. And I think we have a few missing pieces for it to happen in the data world. But I'm hopeful based on the conversations that I had with with companies, since this has started, I've heard a lot of you know, horror stories and failure stories and the need for change. So the need is there have also been talking to many companies that have come forward and say we were actually doing this, let's let's talk. So they haven't maybe publicly published and talked about their approach. But they're internally there in the kind of the leading front of changing the way and challenging the status quo around data management.
Tobias Macey
0:42:39
Yeah, one of the issues with data lakes is up the face value, the the benefit that they provide is that all of the data is co located. So you reduce latencies when you're trying to join across multiple data sets, but then you potentially lose some of the domain knowledge of the source teams or source systems that were generating the information in the first place. And so now we're moving in the other direction of trying to bring a lot of the domain expertise to the data and providing that in a product form. But as part of that, we then create other issues in terms of discover ability of all the different data sets consistency across different schemas for being able to join across them, where you, if you leave everybody up to their own devices, they might have different schema formats, and then you're back in the area of trying to generate Master Data Management Systems and spend a lot of energy and time trying to be able to coerce all of these systems to have common view of the information that's core to your overall business domain. And so when I was reading through the post, initially, I was starting to think that, you know, we're trading off one set of problems for another. But I think that by having the underlying strata of the data platform team, and all of the actual data itself is being managed somewhat centrally from the platform perspective, but separately from the domain and expertise perspective, I think we're starting to reach a hybrid where we're optimizing for all the right things, and not necessarily having too many trade offs in that space. So I'm curious what your perspective is, in terms of those areas of things like data discovery, schema consistency, where the responsibilities lie between the different teams and or and organizationally for ensuring that you don't end up optimizing too far in one direction or the other?
Unknown
0:44:35
Absolutely, I think you made a very good point and very good observation, we definitely don't want to swing back into data silos and you know, inconsistency, the fundamental principle behind any distributed system to work is interoperability. So you mentioned a few things. One is the data locality for performance, like, you know, data gravity, where the operations the computation will happen, like, for example, your pipelines accessing that data or running close to your data and the data is co located, I think those underlying physical implementations, we should bring those learnings from data lake design into the database, they should be absolutely there. And distributed ownership model does not mean a distributed physical location of the data. Because as you mentioned, the data infrastructure team is their job is providing that, you know, location agnostic, or fit for purpose location as a service to or storage as a service to those data productions. So not every team willy nilly deciding where I'm going to store my data, or how am I going to store the data, there is a, you know, infrastructure that is set up to incorporate all of those, you know, performance concerns, and the scalability concerns and Nicola location concern of the data and computation and provide, you know, in a in somewhat agnostic way to those data product teams, where data should be located. So physically, if I'm on the same, you know, a bunch of his three buckets or on an instance of Caprica, or wherever it is, as a data product team, I shouldn't care about that I should go to data in my data infrastructure to get that service from them. And they are best positioned to tell me where my database on my needs, obviously, and the scale of my data and the nature of it, it goes. So we still can bring all of those best practices around the location of the data, and so on into the infrastructure. Discovery ability is another aspect of it. So I can, I think one of the most important properties of a database is having some sort of a global and, you know, single point of access for a single point of discovery system so that as someone who wants to generate a new solution, I want to make a smart decision about, you know, get some insight about my patients and my health insurance example,
0:47:01
I need to be able to go to one place, which is my, you know, data catalog, for lack of better world data discovery system, that I can see all of the data products that are available for me to use, I can see who owns them, you know, what are the meta information about that data, what's the quality of it, when was the last time it's got updated with a cadence of it being updated, get, you know, sample data to play with so that discovery ability, kind of function, I see that as a centralized and a global function, that in a federated way, you know, every data product can register itself, you know, with that with that discovery tool. And we have primitive implementation of that, which are just conference pages to more kind of maybe like advanced implementation. So it's just that's that's globally accessible and centralized, I think solution. And one other thing that you mentioned was well, now those teams that that you leave the decision around how the data should be provided and what the schema should be, that hasn't changed from the Lakers delay concern, like implementation that leaves that also to the at least in this pure initial definition to the data sources. But as I mentioned, interoperability, and standardization is a key element. Otherwise, we end up with a bunch of domain data that there is no way I can join them together. For example, there are some concerns some entities like a member, right, the member in a health insurance, and it claims system probably has its own internal IDs in I don't know, the health care provider system has its own internal ID. So the member is a policy entity that crosses different domains. So if we don't have standardization around adapting those internal member IDs to some sort of a global member ID where he can join the data across different domains, these disparate domains data, they just good by themselves that don't allow me to do you know, merging and filtering and searching. So the governance that global governance and standardization is absolutely key. And that governance and standardization sets, you know, set standards around anything that allows interoperability from you know, how the maybe formats of certain fields, how we represent the time and, you know, time zones to how do we do policies of federated identity, identity management, to what is the unique way of securely accessing data, so we don't have five different ways of security implementation or secure access model. So all of those concerns, I think, part of that global governance and implementation of that, as you mentioned, goes into your data infrastructure as as tooling that people can easily use as shared tooling. The API revolution wouldn't have happened if we didn't have a dp as a base as a baseline standardization, race breast as a standardization. So these are standards have allowed us to, to be able to decentralize our monolithic systems, we had the you know, API gateways and API documentation is a centralized place to find things like find what are the API's I need to use. So I think that same same concerns, and the best practices out of the lake should come into the database and not get lost.
Tobias Macey
0:50:14
And from an organizational perspective, particularly for any groups that already have an existing data like implementation, talking through this, I'm sure it sounds very appealing. But it can also sound fairly frightening because of the amount of work that's necessary. And as you mentioned, the necessary step is to take a slow and iterative approach to going from point A to point B. But there are also scales of organization where they might not have the capacity to even begin thinking about having separate domains for their data, because everybody is just one group. And so I'm curious what your perspective is on how this plays out for smaller organizations, both in terms of the organizational patterns, but also in terms of the amount of overhead that is necessary, sorry, for this approach, and whether there is a certain tipping point where it makes more sense to go from a data lake to a data mash, or vice versa?
Unknown
0:51:08
That's a very good question. On the smaller kind of scale organizations, if you if they have a data lake, if they have a centralized team that is ingesting data from many different systems, and this is this team is separate from the operational teams, but somehow, because of the size they can manage, and they have, they have closed the loop. And by closed loop, I mean, it's not that they just, you know, consume data and put it in one place, but also have satisfied the, the requirements of use cases to use the data and they have become a data driven company, if you're there and you have managed to, you know, build a close tight loop between your operational systems providing the data and you know, intelligence services or nlb services that are sitting on top of the data and consuming it and you have a fast, you know, fast cycle of iteration to to update data, and also to update your models. And that's working well for you. There's no need to change a lot of startups and scale ups, you know, they're still using their monolithic Rails applications that build the first day, and that is still okay. But if you have pain, and if the pain you're feeling is around ingestion of the data and keeping that data up to date, understanding the data you have you see fragility, if that's a word, you see the fragile points of connectivity between the source source being changed and you know, your your pipelines falling apart, or you you are struggling to respond to this kind of the innovation agenda of your organization. We, we we talked about test and learn, and build, test and learn, which requires a lot of experimentation. And your team is struggling to respond to change data to support those experimentation, that I think that's an inflection point to start thinking about how can I decouple the How can I decouple the responsibility and a starting point, a very, very, you know, simple starting point would be let's decouple your infrastructure from the data on top of it. So your data engineers probably today are both looking at the infrastructure pieces, which are cross domain, they really don't don't matter what Data Domain they're serving, or to transforming their cross domain, separate that into a separate team, you know, put a success criteria for them design success criteria, that is, you know, it's aimed to enable people who want to provide data on top of that infrastructure, separate that layer first, when you come to the data itself, find out the domains that are going through change more often, the domains that are you know, changing continuously. Often, those domains are not your, you know, customer facing applications, or the often not the systems, operational systems at the point of origin, they may change, but the data they're emitting not changed, but maybe they do. Often, they're the ones that are sitting in the middle and their aggregates. But however, find out which data domain is changing most rapidly separate those separate that one first. So you know, put even a logical maybe your data team is still the same team. But within that team, some people become the the owners or the custodians of that particular domain, and bring those people together with the actual systems that are emitting the data. If it's a system of origin, or, you know, teams that are aggregating the data in the middle, if it's an aggregate data and start, start experimenting with having built that data set as a separate data product in a way and see how that's working. That would be I think that would be where I would start in a smaller organization, organizations that already have, you know, many, many data warehouses, multiple iterations of a nation's of it, and they're somewhat disconnected and they have a problem la can maybe it's not really working for them. Again, if you see the failure symptoms that I mentioned at the beginning of this conversation, you can scale you can bootstrap, you're not having become data driven. Again, start with finding those use cases of we I always took a value driven use case driven approach, find a few use cases where you can really benefit from data that is directly coming from the source data that is timely and rich, that is changing more often will find use cases of for that data, not just the data itself. But whether the ML use cases or bi use cases, find use cases, group them together, identify the source of real source of data, not intermediary, you know, data warehouses, and start building out this data mesh, one iteration one use case at a time, you know, a few data sources at a time. And that's exactly what I'm doing right now. Because organizations that I work with, they thought mostly fall into that category, that kind of hairy data space that needs to increase mentally change. And there's still you know, there is a place for lay there still a place where data warehouse, like a lot of the BI use cases do require and multi dimensional schema well defined, but they don't need it to need to be the centerpiece of this architecture, they become a node on your mesh, mostly closer to the consumption. Use Cases, because they satisfy a very specific set of use cases around business intelligence. So they become the consumers of your measure, not not the central piece for providing the data.
Tobias Macey
0:56:32
And in your experience of working with different organizations, and through different problem domains and business domains. I'm wondering if there are any other architectural patterns or anti patterns or design principles that you think that data professionals should be taking a look at that aren't necessarily widespread within that community?
Unknown
0:56:51
Yes, I think a lot of the practices that we take somewhat for granted today, if you're building a modern, you know, the digital platform, and what modern web application, I think there is, there is a space to bring those practices to data and both data and ml, I think continuous delivery is an example of that, you know, all the code and also the data itself. Also, the models that you create are under the control of your continuous delivery system, there are versions, they have integrity tests, and, you know, test around that, you know, making that deploy to production, which means release it so that it's accessible to the rest of the organization. So I would say the basic, you know, good engineering hygiene around continuous delivery, those are not really well understood, or well established concern concepts in both data and ml. So a lot of our ML engineers or data scientists, they don't even know what does continuous delivery for MMO look like because from you know, making a model in our on your laptop, there is months of work and translation and somebody else ready code for you and putting something on the, you know, in production that using it stale data that it was trained on, and there is no concept around, oh, you know, my data changed. Now I need to kick off another build cycle, because I need to retrain my much my, you know, machine learning algorithms, so that that continuous delivery, both in AI and continuous delivery, both in data world as well, like data integrity tests, and testing for data is not a well understood practice, Virginia data is not a well understood practice. So think schema visioning or backwards compatibility, I would say that would be some of the early kind of the technical concerns, and capabilities that are would introduce in data engineering teams or ml engineering teams.
Tobias Macey
0:58:47
And are there any other aspects of your thinking around data mash, and the challenges of data lakes or anything else pertaining to the overall use of data within organizations that we didn't discuss yet that you'd like to cover? Before we close out the show,
Unknown
0:59:03
I think we pretty much covered everything I would probably over maybe emphasize a couple of points, I think making data as a first class concern, an asset, you know, and, and structured around your domains, it does not mean that you have to have a well model data, it could be your raw data, it could be the right events that are being generated from the point of urgency, but various product thinking, and, you know, self serve, and some form of IP understood or measured quality and good documentation around it, so that other people can use it, and you treat it as a product. But it doesn't necessarily mean we were doing a lot of modeling of the data. The other the other thing that I would mention, I think it's I guess we have already talked about it, I think the governance and standardization I I would love to see more standardization, the same way that we saw with the web, and, you know, with API's apply to data. So we don't have a lot of either open source, like a lot of different open source tools, or a lot of different, you know, kind of proprietary tools. But there isn't a, you know, there isn't a standardization that allows me, for example, to run distributed SQL queries across a diverse set of data sets. I mean, the cloud providers are in the race to provide all sorts of, you know, wonderful data management capabilities on their platform. And I hope that race would lead in some form of standardization that allows, you know, distributed systems to work. And intentionally I think a lot of the technologies we see today, even around data discovery is, is based on the assumption that data is operational data hidden in some database in a corner of the organization is not intended for sharing, but we need to go find it, and then extract the data out of it. I think that's Florida predisposition, I think we need to think about tooling that would allow intentionally diverse intentionally shared, diverse set of data data sets, and what does it mean? Like there's a huge opportunity for tool makers out there, I think there is a big white space to build next generation tools that are not designed to fight the data. The fight is bad data hidden somewhere. They're designed to share and make, you know, intentionally shared and intentionally treated as assets, data discoverable and accessible and, you know, measurable and clearable. But distributed Lee owned kind of data sets. So I think those are the few final points to overemphasize.
Tobias Macey
1:01:50
All right. And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And we've talked a bit about but I'd also like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Unknown
1:02:08
Standardization, I would just, if I could wish for one thing was a little bit of a convergence and standardization in that allows still a polyglot world, you know, you can still have a polyglot world, but I want to see something like, you know, convergence that happened around Kubernetes in the infrastructure and operational world, some similar similar or the standardization that we had with, you know, history, TP and re, you know, rest or gr PC and so on in the world of data so that we can support a polyglot, you know, an ecosystem. So I think I'm looking for tools that are ecosystem players and kind of supported distributed in a polyglot data world, not data that can be managed just because we put it in one database or one data store, just because it's used by one party owned and ruled by one particular tool. So open standardization around data is what I'm looking for. And there are some, you know, small movements. Like, if you look at the end, they're not coming, unfortunately, not coming from the data world. Like, for example, the work of CNC f off the back of the circle is thinking about if the events are one of the fundamental concepts, you know, talking about cloud events as a standard way of describing events. But that's coming from left field, again, that's coming from an operational world trying to play in an ecosystem, not a data world. And I hope we can get more of that from the data world.
Tobias Macey
1:03:33
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing and your thoughts on the matter of how we can make more scalable and maintainable data systems in our organizations. And as an industry, it's definitely a lot to think about. And it mirrors a lot of my thinking in terms of the operational characteristics. So it's definitely great to get your thoughts on the matter. So thank you for that. And I hope you enjoy the studio. Thank
Unknown
1:04:00
you, Tobias. Thank you for having me. I've been a big fan of your work and what you're doing, getting the you know, information about this silo of data management, to the to everyone and really making that information available to me at least since I've become a data enthusiast. Thank you for having me. I'm happy to help
Tobias Macey
1:04:18
Have a good day.

Data Labeling That You Can Feel Good About - Episode 89

Summary

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what CloudFactory is and the story behind it?
  • What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
  • What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
  • Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
    • What protocols do you have in place to ensure data quality and identify potential sources of bias?
  • What role do humans play in the lifecycle for AI and ML projects?
  • I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
    • How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
  • Can you share some stories of cloud workers who have benefited from their experience working with your company?
  • What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
  • What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
  • What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
  • What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
    • How does that tie into your plans for CloudFactory in the medium to long term?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA