Streaming

Navigating Boundless Data Streams With The Swim Kernel - Episode 98

Summary

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Swim.ai is and how the project and business got started?
    • Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
  • What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
  • How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
  • Can you describe a typical design for an application or system being built on top of the Swim platform?
    • What does the developer workflow look like?
      • What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
  • Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
  • For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
    • What mechanisms are in place to account for network failures?
  • Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
  • Since there is no explicit data layer, how is data redundancy handled by Swim applications?
  • What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
  • What have you found to be the most challenging aspects of building the Swim platform?
  • What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
  • What do you have planned for the future of the technical and business aspects of Swim.ai?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Lynn ODE with 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering, podcast.com, slash Lenovo, that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and media or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence data Council, upcoming events and do Riley AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Simon Crosby about swim.ai, the Data Fabric for the distributed enterprise. So Simon, can you start by introducing yourself?
Simon Crosby
0:02:28
Hi, I'm Simon Crosby, I am the CTO, I guess of long duration. I've been around for a long time. And it's a privilege to be with the swim folks who have been building this fabulous platform for streaming data for about five years.
Tobias Macey
0:02:49
And do you remember how you first got involved in the area of data management?
Simon Crosby
0:02:53
Well, I have a PhD in applied mathematics and probably, so I am kind of not data management guy. I'm an analysis guy. I like what comes out of, you know, streams of data and what influence you can draw from it. So my background is more on the analytical side. And then along the way, I saw begin to how to build big infrastructure for it.
Tobias Macey
0:03:22
And now you have taken up the position as CTO for swim.ai, I'm wondering if you can explain a bit about what the platform is and how the overall project and business got started?
Simon Crosby
0:03:33
Sure. So here's the problem. We're all reading all the time. But these wonderful things that you can do with machine learning, and streaming data, and so on, it all involves cloud and other magical things. And in general, most organizations chest don't know how to make head or tail of that, for a bunch of reasons, it's just too hard to get there. So if you're an organization, with assets, that are chipping out lots of data, and that could be a bunch of different types, you know, you probably don't have the skill fit in house to deal with a vast amount of information. And we're talking about boundless data sources, yet things that never showed up. And so to deal with these data flow pipelines to deal with it itself, to deal with the learning and inferences you might draw from that, and so on. And so, enterprises, a huge skill set challenge. There is also a cost challenge, because today's techniques related to drawing inference from data in general resolve with it, you know, in large, expensive, dead legs, either in house or perhaps in the cloud. And then finally, there's a challenge with the timeliness within which you can draw an insight. And most folks today, believe that you store data, and then you think about it in some magical way. And you draw inference from that. And we're all suffering from the Hadoop Cloudera, I guess, after effects, and really, this notion of storing and then analyzing needs to be dispensed with in terms of fast it, certainly for boundless data sources that will never stop. It's really inappropriate. So when I talk about boundaries, today, we're going to talk about data streams that just never stop. And Ross can talk web, the need to derive insights from that data on the fly, because if you don't, something will go wrong. So it's of the type that would stop your car before you hit the pedestrian, the crosswalk, that kind of stuff. So for that kind of data, there's just no chance to know still down hard disk. And then
Tobias Macey
0:06:16
and how would you differentiate the work that you're doing with the swimming AI platform and the swim OS kernel from things that are being done with tools such as Flink or other streaming systems, such as Kafka that is now got capabilities for being able to do some limited streaming analysis on the data as it flows through, or also platforms such as wall a room that are built for being able to do state for computations on data streams?
Simon Crosby
0:06:44
So first of all, there have been some major steps forward. And anything we do we stand on the shoulders of giants. Let's start off with distinguishing between the large enterprise skill set that's out there, and the cup world. And all the things you mentioned live in the cloud world. So at that reference distinction, most people in the enterprise when you said Flink wouldn't know what the hell you talking about. Okay, similarly will ruin anything else, they just wouldn't know what you're talking about. And so there is a major problem with the tools and technologies that are built for the cloud, really for against for log cloud native applications, and the majority of enterprises who just their step with legacy IT and application skill set, and they still come up to speed with the right thing to do. And to be honest, they're getting over the headache of Hadoop. So then, if we talk about cloud native world, there is a fascinating distinction between all the various projects, which have started to tackle streaming data. And there have been some major progress has been made some major progress there, Jim be delighted to point out some being one of them, and have been going into each one of those projects in detail as we go forward. The key point being that, first and foremost, the large majority of enterprises just don't
Tobias Macey
0:08:22
know what to do. And then within your specific offerings, there is the data Friedberg platform, which you're targeting for enterprise consumers. And then there's also the open source kernel of that in the form of swim OS. I'm wondering if you can provide some explanation as to what are the differentiating factors between those two products and the sort of decision points along when somebody might want to use one versus the other?
Simon Crosby
0:08:50
Yeah, let's cut it first at the distinction between the application layer and the infrastructure needed to run largest distribute data for pipeline. And so for swim all of the application layer stuff that then there's everything you need to build nap is entirely open source. Some of the capabilities that you want to run a large distributed data pipeline are proprietary. And that's really just because, you know, we're building a business around this, we plan to open source more and more features more and more features over time.
Tobias Macey
0:09:29
And then as far as the primary use cases that you are enabling with the swim platform, and some of the different ways that enterprise organizations are implementing it, what are some of the cases were using something other than swim, either the OS or the Data Fabric layer would be either impractical or intractable if they were trying to use more traditional approaches such as Hadoop, as you mentioned, or data warehouse and more batch oriented workflows?
Simon Crosby
0:09:58
So So let's start off describing what swim does, can it can I do that, that that might help our in our view, it's our job to build the pipeline, and indeed the model from the data. Okay, so swim, just once data, and from the data we will build, automatically build this typical data flow pipeline. And indeed, from that, we will build a model of arbitrarily interesting complexity, which allows us to solve some very interesting problems. Ok. So the swim perspective, starts with data. Because that's where our customers journey starts. They have lots and lots of data, they don't know what to do with it. And so the approach we take and swim is to allow the data to build the model. Now, you would naturally say that's impossible, in general, but requires is some oncology at the edge, which describes the dead, you could think of it as a schema, in fact, basically, to describe what data items mean, in some sort of useful sense to us as modelers. But then given data swim will build that model. So let me give you an example. Given a relatively simple ontology, for traffic, and traffic equipment, so position lights, the loops, and the road, the lights and so on, swim will build a model, which is a staple, digital twin is where for every sensor, every in every source of data, which is running in concurrently in some distributed fabric, and processes its own raw data and truly evolves, okay. So simply given that ontology, some knows how to build, stay faithful, concurrent, little things we call web engines, actually, I'm yeah, I'm using that term,
0:12:18
I guess the same as digital twin.
0:12:21
And these are concurrent things which are going to stay fluid process raw data and represent that in a meaningful way. And the cool thing about that is that each one of these little digital twins exists in a context, a real world context, that term is going to discover for us. So for example, a an intersection might have 60, to 80, sensors. So this notion of containment, but also, intersections are adjacent to other sections in the real world map. And so on. That notion of a Jason's is also real world relationship. And in swimming, this notion of a link allows us to express the real world relationships between these little digital twins. And linking in swim has this wonderful additional property, which is to allow us to express it essentially, as soon swim, there is never a pub, but there is up. And if something links to something else so filing to you, then it's like LinkedIn for things, I get to see the real time updates of in memory state still buy that digital twin. So digital twins, a link to a digital twins courtesy of real world relationships, such as containment or proximity. We can even do other relationships, like correlation,
0:14:05
also linked to each other, which allows them to share data.
0:14:09
And sharing data allows interesting computational properties to be derived. For example, we can learn and predict. Okay, so job one is to define the songs ology something goes and builds a graph, and a graph of digital twins, which is constructed entirely from the data. And then the linking happens as part of that. And that allows us to then construct interesting competitions.
0:14:45
Is that useful?
Tobias Macey
0:14:46
Yes, that's definitely helpful to get an idea of some of the use cases and some of the ways that the different concepts within swim work together to be able to build out to what a sort of conceptual architecture would be for an application that would utilize swim.
Simon Crosby
0:15:03
So the key thing here is I'm talking about an application bit just said, the application is to predict the future, the future traffic in a city, or what's going to happen in the traffic area right. Now, I could do that for a bunch of different cities, what I can tell you is I need a model for each city. And there are two ways to build a model. One way is I get a data scientist to have them build them, or maybe they train it and a whole bunch of other things. And I'm going to have to do this for every single city where I want to use this application. The other way to do it is to build the model from the data. And that's the approach. So what swim does is simply given the ontology, build these little digital twins, which are representatives of the real world things, get them to stay fully evolve, and then link to other things in, you know, to represent real world relationships. And then suddenly, hey, presto, you have built a large graph, which is effectively the model that you would have had to average a human build otherwise, right? So it's constructed in the sense that in any new city you go to this thing is just going to unbundle and just given a stream of data, it will build a model, which represents the things that are the sources of data and their physical relationships. That make sense.
Tobias Macey
0:16:38
Yeah, and I'm wondering if you can expand upon that, in terms of the type of workflow that a developer who is building an application on top of swim would go through as far as identifying what those ontology is, are and defining how the links will occur as the data streams into the different nodes in the swimming graph.
Simon Crosby
0:17:01
So the key point here is that we think that we will do, and then we can build, like 80% of a nap, okay, from the data. And that is we can find all of the big structural, red, all structural properties of relevance in the data, and then let the, the application builder drop in what they want to compute. And so let me try and express is slightly differently. Job, one, we believe is to build a model of the staple digital twins obby, which almost mirror their real world counterparts. So at all points in time, their job is to represent the real world, as faithfully and as close to real time as they can in a stable way, which is relevance to the problem at hand. Okay, so rather involved, so I'm going to have a red light, okay, something like that. And the first problem is to build this, the central digital twins, which are interlinked, which represent the real world being said, okay, and it's important to separate that, from the application layer component of what you want to compute from that. So frequently, we see people making the wrong decision that is hard, hard coupling, the notion of prediction, or learning or any other form analysis into the application in such a way that any change requires programming. And we think that that's wrong. So job one is to have this faithful representation of a real time world in which everything evolves its own state, whenever it's real world when evolves, and evolves stay pretty. And then the second component to that is, which of which we do on a separate timescale is to inject operators, which are going to then compute on the states of those things at the edge, right. So we have a model, which represents the relationships between things in the real world. It's attempting to evolve as close as possible to real time in relationship to the real world twin, and it's reflecting its links and so on. But the notion what you want to compute from it is separate from that and decoupled. And so the second step, which is an application, or building an application right here, right now, is to drop in an operator, which is going to compute a thing from that. So you might say, cool, I want every digital, every intersection to compute, you know, to be able to learn from its own behavior and predict. That's one thing, we might say, I want to compute the average wait time of every kind and see, that's another thing. So the key point here is that computing from these rapidly evolving world worldviews, is decoupled from the actual model of what's going on in that world and point in time. So it's from reflects that decoupling by allowing you to bind operators to the model whenever you want.
0:20:45
Okay,
0:20:46
bye whenever you want. I mean, you can write them in code and bits of job or whatever. But also, you can write them in blobs of JavaScript or Python, and dynamically insert them into a running model. Okay, so let me make that one concrete for you. I could have a deployed system, which is a model a deployed graph of digital twins, which are currently mirroring the state of Las Vegas. And data dynamically, a data scientist says, Let me compute the average wait time of red cars at these intersections, and drop said in as a blob of JavaScript attached to every digital twin for an intersection. That is what I mean by an application. And so we want to get to this point where the notional application is not something deeply hidden in somebody's, you know, notebook, or Jupiter notebook, or in some program his brain and they quit and wander off to the next startup 10 months ago, an application is what everyone or no right now grew up into a running model.
Tobias Macey
0:22:02
So the way that sounds like to me is that swim essentially acts as you deploy the infrastructure layer to ingest the data feeds from the sets of sensors, and then it will automatically create these digital twin objects to be able to have some digital manifestation of the real world so that you have a continuous stream of data and how that's interrelated. And then it sort of flips the order of operations in terms of how the data engineer and the data scientists might work together, where the way that most people are used to, you will ingest the data from these different sensors, bundle it up, and then hand it off to a data scientist to be able to do their analyses. They generate a model and then hand it back to the data engineer to say, Okay, go ahead and deploy this and then see what the outputs are where instead, the swim platform essentially acts as the delivery mechanism and the interactive environment for the data scientists to be able to experiment with the data, build a model, and then get it deployed on top of the continuously updating live stream of data, and then be able to have some real world interaction with those sensors, in real time, as they're doing that to be able to feed that back to say, okay, red cars are waiting 15% longer than other cars at these two intersections, and I want to be able to optimize our overall grid, and that will then feed back into the rest of the network to have some physical manifestation of the analysis that they're trying to perform to try and maybe optimizing all traffic.
Simon Crosby
0:23:39
So there are some consequences for that, first of all, the every algorithm has to compute stuff on the fly. So if you look at, you know, the kind of store and then analyze approach to Big Data Type learning, or training or anything else, you know, you have a little bit here, you don't. And so every algorithm that is part of swim, is coded in such a way is to continually process data. And that's fundamentally different to most frameworks. Okay, so for example,
0:24:19
the,
0:24:21
the Learn and predict cycle is what, you know, you mentioned training, and, and so on. And that's very interesting. But no train flies, that I collect and store some train data, and that it's complete and useful enough to try the model back and then hand back. You know, what, if it isn't, and so, in whom we don't do that, mean, we can if you want, if you have a bottle less no problem for us to be dead, too. But instead, in swim, the input vector, say to a prediction, I will say DNA is precisely the current state of the digital twins for some bunch things, right? Maybe the set of sensors in the neighborhood of the urban intersection. And so this is a continuously burying real world triggered scenario in which real data is fed through the algorithm, but is not stored anywhere. So everything is fundamentally streaming. So we assume that data streams continually and indeed, the output of every algorithm streams continually. So what you see when you compete in the average is the current average. Okay, when you see when you when you're looking for heavy hitters, the what you see as the current heavy hitters. All right. And so every algorithm has its streaming, twin, I guess. And and part of the art in the same context is reformulating the notion of of analysis into a streaming context. So that you never expect a complete answer, because there isn't one is just what I've seen until now. Okay, and what I've seen until now has been fed through the algorithm, this is the current answer. And so every algorithm, compute and streams. And so the notion of linking, which I described earlier for swim between digital twin say, applies also to these operators, which effectively would link to things they want to compute from, and then they would stream their results. Okay, so if you LinkedIn, you see a continued update. And for example, that stream could use to be could be used to feed a Kafka cathkin limitation, which would serve a bunch of applications, you know, the notion of streaming is, is pretty well understood. So we can feed other bits of the infrastructure very well. But fundamentally, everything is designed to stream,
Tobias Macey
0:27:21
it's definitely an interesting approach to the just overall workflow of how these analyses work. And one thing that I'm curious of is how data scientists and analysts have found working with this platform in terms of ways that they might be used to, you know, you're interested in, in what they scientists would view or how they view this,
Simon Crosby
0:27:45
to be honest, in general with surprise.
0:27:50
Our experience today has been largely with people who don't know what the heck they're doing in terms of data science. So they're trying to run an oil rig more efficiently they have, what about 10,000 sensors, and they want to make sure this thing isn't going to blow up. Okay? So tend to be heavily operationally focused folks. They're not that scientists, they never could afford one. And they don't understand the language of data science, or have the ability to build cloud based pipelines that you and I might be familiar with. So these are folks who effectively just want to do a better job, given this enormous stream of data they have, they believe they have something in the data, they don't know what that might be. But they came to go and see. Okay. And so those are the folks who spent most of our time with, I'll give you a funny example, if you'd like char man.
Tobias Macey
0:29:00
illustrative,
Simon Crosby
0:29:02
we work with a manufacturer of aircraft.
0:29:05
And they have very large number of RFID tag parts, and equipment to and so if you know anything about RFID, you know, it's pretty useless stuff is built from about 10 years ago, 20 years ago. And so what they were doing is from about 2000, readers, again, about 10,000 reads a second. And each one these read is simply being written into an oracle database, at the end of the day that try and reconcile the soul with what whatever parts have, and wherever the thing is, and so on. And this whom solution to this is entirely different, it gives you a good idea of why we care about modeling data, or thinking about data differently. We simply built a digital twin for every tag, the first time it's seen, we create one. And then they know they have been in for a long time, they just expire. And whenever a reader sees attack, it simply says, Hey, I saw you. And this was the signal strength. Now, because tanks get seen by multiple readers, the each digital 12 attack does the obvious thing. It triangulate from the readers. Okay, so it learns the attenuation different parts of the plot. It's very simple initially, it that's the word learn there is a rather stretch to British straightforward calculation, and then suddenly can work out where it is in three space. So instead of an oracle database, or a database full of tag berries, and lots and lots of post processing, you know, but the Kepler Raspberry Pi's and each one NNE, Raspberry Pi's, you know, have millions of these tanks running, and then you can ask any one of them where it is. Okay, and you then you can do even more, you can say, hey, show me all the things within three meters of this tech, okay? And that allows you to see components being put together into real physical objects, right? So as a fuse, ours gets built up the engine or whatever it is. And so a problem, which was tons of infrastructure, and tons of taghreed got tend to Raspberry Pi's with stuff, which kind of self organized and into a phone, which could feed real time visualization and controls around what what bits of infrastructure were.
0:31:52
Okay. Now, that
0:31:54
was transformative for this outfit, which was, which quite literally had for tackling the problem this way.
Tobias Macey
0:32:02
Does that make sense? Yeah, that's definitely very useful example of how this technology can flip the overall order of operations and just the overall capabilities of an organization to be able to answer useful questions. And the idea of going from, as you said, an Oracle Database full of all these just rows and rows of records of this tag, read at this point in this location. And then being able to actually get something meaningful out of it. As far as this part is in this location in the overall reference space of the warehouse is definitely transformative, and probably gave them weeks or months worth of additional time in terms of lead time for being able to predict problems or identify areas for potential optimization.
Simon Crosby
0:32:47
Yeah, I think we said them $2 million a year. Let me tell you what, from this tale come two interesting things. First of all, if you show up at customer service running on Raspberry Pi, you can charge them a million bucks. Okay, that's less than one lesson too is that the volume of the data is not relevant, or not related to the value of the insight. Okay. I mentioned traffic earlier. In the city of Las Vegas, we get about 20 1516 terabytes per day of the traffic infrastructure. And every intersection, every digital twin, every intersection in the city predicts two minutes into future, okay. And those insights are sold in an API in Azure, to customers like Audi and Uber and Lyft, and whatever else, okay. Now, that's a ton of data, okay, it's just you couldn't even think of where to put in your cloud. But the manual, the inside is relatively low. This is the total amount of money Agni extract from Uber per month per intersection is low. Alright, by the way, all this stuff is open source, you can go grab it, and play and hopefully make your city better. So what from that you can go there, it's not a high enough value for me to do anything other than say, go grab it and run. So vast amounts of data and relatively important, but not commercially relevant value.
Tobias Macey
0:34:35
And another aspect of that case, in particular, is that despite this volume of data, it might be interesting for being able to do historical analyses. But in terms of the actual real world utility, it has a distinct expiration period where you have no real interest in the sensor data as it existed an hour ago, because that has no particular relevance on your current state of the world and what you're trying to do with it at this point in time,
Simon Crosby
0:35:03
yeah, you have historical interest in the sense of wanting to know if your predictions were right, or wanting to know about traffic engineering purposes, which runs on a slower time scale. So some form bucketing or whatever assemble, terse followed, recording is useful. And sure, that easy. But you certainly did not want to record it, there were no dead rate.
Tobias Macey
0:35:30
And then going back to the other question I had earlier, when we were talking about the workflow of an analyst, or a data scientist pushing out their analyses live to these digital twins and potentially having some real world impact. I'm curious if the swim platform has some concept of a dry run mode, where you can deploy this analysis and see what the output of it is without it and see maybe what impact it would have without it actually manifesting in the real world for cases where you want to ensure that you're not accidentally introducing error or potentially having a dangerous outcome, particularly in the case that you were mentioning of an oil and gas rig.
Simon Crosby
0:36:12
Yeah, so I'm with the 1% XE. Everything we've done thus far has been open loop in the sense that we're informing another human or another application, but we're not directly controlling the structure. and the value of a dry run would be enormous, you can imagine in those scenarios, but thus far, we don't have any use cases that we can report of using some for direct control. We do have use cases where on a second by second basis, we are predicting whether machines are going to make an error they make as they build PCBs, for servers, and so on. Then again, what you're doing is you're calling from ladies come over and fix the machine, you're not, you know, you're not trying to change the way the machine bags.
Tobias Macey
0:37:06
And now digging a bit deeper into the actual implementation of swim, I'm wondering if you can talk through how the actual system itself is architected. And some of the ways that it has evolved as you have worked with different partners to deploy it into real world environments and get feedback from them, and how that has affected the overall direction of the product roadmap.
Simon Crosby
0:37:29
So swim is a couple of megabytes of job extensions. Okay? So it's extremely lean, we tend to deploy in containers using the growl VM. To very small, we can run in, you know, probably 100 megabytes or so. And so, people tend to think of when people tend to think of edge, they tend to think of branding in the educated ways or things, we don't really think of Ag that way. And so an important part of defining edge, as far as we're concerned, is simply gaining access to streaming data, we don't really care where it is, but to me small enough to get on limited amounts of compute towards the physical edge. And the, you know, the product has evolved in the sense that, Originally, it was a way of building applications for the edge and you'd sit down, write them in Java, and so on.
0:38:34
laterally, this ability to simply let
0:38:39
let the app application data or let the data build the app, or most of the app can bonus in response
0:38:46
to customer needs.
0:38:49
But swim is deployed, typically in containers, and for that we have in the current release relied very heavy on the Azure IoT edge framework. And that is magical, to be quite honest, because we can rely on Mac soft machinery to deal with all of the painful bits of deployment and lifecycle management for the code base and the application as it runs. These are not things we are really focused on what we're trying to do is build a capability which will respond to data and do the right thing for the application developer. And so we are fully published in the Azure IoT Hub, and you can download this and get going and managers through a cycle that way. And so in several use cases, now, what we're doing is we are use to feed fast time skill, insights at the physical edge, we are labeling data and then dropping it into Azure AD pls, Gen two, and feeding insights into applications built in Power BI. Okay, so it just for the sake of machinery, you know, using the Azure framework for management of the IoT edge, by the way, I think IoT edge is too bad, the worst possible name you could ever pick, because all you want is a thing to manage the lifecycle management of a capability, which is going to deal with fast data. Whether it's at the physical edge or not, is immaterial. But that, but that's basically what we've been doing is relying on Microsoft's fabulous Lifecycle Management Framework for that, plugged into the IoT Hub, and all the Azure IoT small as your services generally, for back end things which enterprises love.
Tobias Macey
0:41:00
Then another element of what we're discussing in the use case, examples that you were describing, particularly, for instance, with the traffic intersections, is the idea of discover ability and routing between these digital twins, as far as how they identify the cardinality of which twins are useful to communicate with and establishing those links, and also at the networking layer, how they handle network failures in terms of communication and ensuring that if there is some sort of fault that they're able to recover from it,
Simon Crosby
0:41:36
symbols, let's talk about two layers. One is the app layer. And the other one is the infrastructure, which is going to run this effective is distributed graph.
0:41:45
And so assume is going to build this graph for us
0:41:49
from the data. What that means is the digital twin, by the way, we technically call these web agents, these little web agents are going to be distributed somewhere a fabric of physical instances, and they may be widely geographically
0:42:06
distributed. And
0:42:08
so there is a need, nonetheless, at the application layer for things which are related in some way linked physically or, you know, in some other way to be able to link to each other that says to
0:42:23
me couldn't have a sub. And so links
0:42:27
require that object, which are the digital twins have the ability to inspect
0:42:33
each other's data,
0:42:34
right, their members, and of course, is something is running on the other side of the planet, and you're linked to it, how on earth is that going to work. So we're all familiar with object oriented languages and objects in one address space, that's pretty easy. We know what an
0:42:50
object handle or an object
0:42:51
reference or a pointer or whatever we get it, but when these things distribute, that's hot, and so in swim with your an application program, where you will simply use object references, but these resolve to your eyes. So in practice, at runtime, the linking is when I link to you, I'll link to your eye. And that link,
0:43:17
will it's resolved by swim
0:43:19
enables a continuous stream of updates to flow from you to me. And if we happen to be on different instances that is running in different address spaces, then there will be over a mash of all my direct web sockets connection between your instance in mind. And so in any swim deployment, all instances are interlinked. So each link to each other using a single web sockets connection, and then these links permit the flow of information between linked digital twins. And what happens is, whenever in a change in the in memory, state of a linked, you know, digital twin happens, what happens is that it's instance, then streams to every other linked object and update to the state for that thing, right. So what are quite what's required is, in effect, a streaming update to Jason, but because of, we're going to record our model in some form of like JSON state or whatever, we would not need to be able to update little bits of it as things change until we use a protocol called warp for that. And that's a swim capability, which we've open sourced. And what that really does is bring streaming to Jason right, streaming updates two parts of a Jason model. And then every instance in swim maintains its own view of the whole model. So as things streaming, the local view of the model is change. But the view of the of the via the world is very much one of a consistency model based on whatever happens to be executing locally and whatever needs to view state certain, eventually consistent dare model, which every node eventually learns the entire thing.
Tobias Macey
0:45:22
And generally, eventually here means you know, a link, so a link away from real time, right, so links delay away from real time. And then the other aspect of the platform is the state fullness of the computation. And as you're saying that that state is eventually consistent dependent on the communication delay between the different nodes within the context graph. And then in terms of data durability, one thing I'm curious of is the length of state, or sort of the overall buffer that is available, I'm guessing is largely dependent on where it happens to be deployed, and what the physical capabilities are of the particular node. And then also, as far as persisting that data for maybe historical analysis, my guess is that that relies on distributing the data to some other system for long term storage. I'm just wondering what the overall sort of pattern or paradigm is for people who want to be able to have that capability?
Simon Crosby
0:46:24
Oh, this is a great question. So in general, the move going from some horrific raw data form on the wire from this the original physical thing to you know, something much more efficient and meaningful in memory, and generally much more concise, so we get a whole ton of dead redaction am I. And so the system is focused on streaming, we don't stop you storing your original data, if you want to, you might just have to discover or whatever the key thing into them is, we don't do that on the hard path. Okay, so things change this day to memory, and maybe compute on that. And that's what they do first and foremost, and then we lately throw things to disk, because disks happens slowly relative to compute. And so typically, what we end up storing is the semantic state of the context graph, as you put it, not the original data.
0:47:23
That is, for example, in traffic world,
0:47:26
you know, we store things like this slide turn red at this particular time, not the voltage on all the registers in the light, and to get massive data reduction. And that form of data is very amenable to storage in the cloud, say or somewhere else. And it's even affordable at, at reasonable rates.
0:47:50
So the key thing for for swimming storage is
0:47:53
you're going to remember as much as you want as much as you have space for locally. And then storage in general, is on the is not on a hot pot, it's not on the computer and string bar and January beginning huge data reductions for every step up the graph we make. So for example, if I go from you know, all the states have all the traffic centers to predictions, then I've made a very substantial reduction in the data remand anyway, right. So as you move up this computational graph, you reduce the amount of data you're going to have to store. And it's up to you really pick what you want to what you want. So
Tobias Macey
0:48:39
in terms of your overall experience, working as the CTO of this organization and shepherding the product direction and the capabilities of this system, I'm wondering what you have found to be some of the most challenging aspects, both from the technical and business sides, and some of the most useful or interesting or unexpected lessons that you've learned in the process.
Simon Crosby
0:49:03
So what's hard is that the real world is not the cloud native world. So we've all seen tablets, examples of Netflix, and Amazon and everybody else doing cool things with data they do. But you know, if you're an oil company, and you have a regarded See, you just don't know how to do this. So, you know, we can come at this, with whatever skill sets we have, what we find is that the real world large enterprises have today are still acres behind the cloud native folk. And that's a challenge. Okay, so getting to be able to understand what they need, because they still have lots of assets, which is generating tons of data is very hard. Second, this notion of edge is continually confusing. And I mentioned previously that, that I would never I've chosen IOTHS, for example, that as your name, because it's not about IoT, or maybe it is, but you may give you two examples. One is traffic lights, say physical things, it's pretty obvious that you're, what the notion of edge is its physical edge. But the other one is this, we build a real time model for millions 10s of millions of headsets for a large mobile carrier in memory, and devolve all the time, right in response to continue to receive signals from these devices,
0:50:38
there is no age,
0:50:40
that is its data and drives over the internet. And we have to figure out where the digital twin for that thing is, and evolve it in real time. Okay, and there, you know, there is no concept of of a of a network to be no or physical edge and traveling over them. We just have to make decisions on the fly and learn and update this month.
0:51:06
So for me, edges, the following thing, edge is stable.
0:51:13
And
0:51:15
cloud is all about rest. Okay, so what I'd say is the fundamental difference between the notion of edge and the notion of cloud that I would like to see, broadly understood is that Where's rest and databases made the cloud very successful, in order to be successful with, you know, this boundless streaming data, state fullness is fundamental, which means rest goes up the door. And we have to move to a model, which is streaming based and staple computation.
Tobias Macey
0:51:50
And then in terms of the future direction, both from the technical and business perspective, I'm wondering what you have planned for both the enterprise product for swim.ai, as well as the open source kernel in the form of CMOS.
Simon Crosby
0:52:06
From an open source perspective, we,
0:52:08
you know, we don't have the advantage of having come up at LinkedIn or something we built it built in us at scale, and be coming out of the startup? Well, we think we built is something which is a phenomenal value. And we're seeing that grow. And our intention is to continually feed their community as much as you can take. And we're just getting more and more stuff ready for open sourcing and ending up.
0:52:36
So we want to see our community
0:52:40
go and explore new use cases for using this stuff, and are totally dedicated to empowering our community. From a commercial perspective, we are focused on honor world, which is edge and moment you said people they tend to get an idea physical edge or something in their heads. And then you know, very quickly, you can get put in a bucket of IoT, I gave an example of say, building a model in real time in AWS for you know, a mobile customer, our intention is to continue to push the bounds of what edge means and and to enable people to build stream pipelines for massive amounts of data easily without complexity, and without the skill set required to invest in these traditionally, fairly heavyweight pipeline components such as beam and flank and, and so on,
0:53:46
to
0:53:47
to enable people to get insights cheaply, and to make the problem of dealing
0:53:51
with new insights from data very easy to solve.
Tobias Macey
0:53:56
And are there any other aspects of your work on swimming is the space of streaming data and digital twins that we didn't discuss yet that you'd like to cover? Before we close out the show?
Simon Crosby
0:54:08
I think we've done pretty good job, you know, I think there are a bunch of parallel efforts. And that's all goodness, that is one of the hardest things has been to get this notion of stapling this more broadly accepted. And I see the function like vendor out there pushing their idea, this a staple functions as service. And really, these are staple amateurs. And there are others out there too. So for me, step number one is to get people to realize that if we're going to take this data that rest and databases are going to kill us, okay? That is there is so much data and the rates are so high that you simply cannot afford to use a stateless paradigm for processing you have to do stay fully. Because, you know, forgetting context every time and then look it up. It's just too expensive.
Tobias Macey
0:55:08
For anybody who wants to follow along with you and get in touch and keeping track of what you're up to. I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today?
Simon Crosby
0:55:26
Well, I think, I mean, there isn't much tooling to be perfect out there a bunch of really fabulous open source code bases and experts in their use. But that's far from tooling. And then there is I guess, an extension of the Power BI downwards. Were, which is something like the monster Excel spreadsheet world, right? So you find all these folks who are pushing that kind of you no end user model of data, doing great things, but leaving a huge gap between the consumer of the insight and the data itself is assuming the data is already there in some good form and can be put into spiritual view, whatever it happens to be. So there's this huge gap in the middle, which is how do we build the model? What does the model tell us? Just off the bat, how do we do this reconstructive Lee in large numbers situations? And then how do we dynamically insert operators which are going to compute useful things for us on the fly in writing models?
Tobias Macey
0:56:44
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on the swim platform. It's definitely a very interesting approach to data management and analytics, and I look forward to seeing the direction that you take it in the future. So I appreciate your time on that. I hope you enjoy the rest of your day.
Simon Crosby
0:57:01
Thanks very much. You've been great be
Tobias Macey
0:57:03
Thank you for listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

Building A Reliable And Performant Router For Observability Data - Episode 97

Summary

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what the Vector project is and your reason for creating it?
    • What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project?
  • What strategy are you using for project governance and sustainability?
  • What are the main use cases that Vector enables?
  • Can you explain how Vector is implemented and how the system design has evolved since you began working on it?
    • How did your experience building the business and products for Timber influence and inform your work on Vector?
    • When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust?
    • What led you to choose Lua as the embedded scripting environment?
  • What data format does Vector use internally?
    • Is there any support for defining and enforcing schemas?
      • In the event of a malformed message is there any capacity for a dead letter queue?
  • What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data?
  • When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations?
  • What options are available to operators to support visibility into the running system?
  • In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy?
  • What are some of the other considerations that operators and administrators of Vector should be considering?
  • You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap?
  • What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink)
  • What are some of the most interesting/innovative/unexpected ways that you have seen Vector used?
  • What are some of the challenges that you have faced in building/publicizing Vector?
  • For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently?
    • What is missing that you would consider necessary for production readiness?
  • When is Vector the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Lynn ODE. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Ben Johnson and Luke Steenson about vector, a high performance open source observe ability data router. So Ben, can you start by introducing yourself?
Ben Johnson
0:01:47
Sure. My name is Ben. I am the co founder CTO temper IO.
Tobias Macey
0:01:53
And Luke, How about yourself?
Luke Steensen
0:01:54
Yeah. I'm Luke Steenson. I'm an engineer at timber.
Tobias Macey
0:01:58
And Ben, going back to you. Do you remember how you first got involved in the area of data management?
Ben Johnson
0:02:02
Yeah. So I mean, just being an engineer, obviously, you get involved in it through observing your systems. And so before we started timber, I was an engineer at sea gig, we dealt with all kinds of observe ability challenges there.
Tobias Macey
0:02:16
And Luke, do you remember how you first got involved in the area of data management?
Luke Steensen
0:02:20
Yeah, so at my last job, I ended up working with Kafka quite a bit in a in a few different contexts. So I ended up getting getting pretty involved with that project, leading some of our internal Stream Processing projects that we were trying to get off the ground, and I just found it, you know, it's a very interesting space. And the more that you dig into a lot of different engineering problems it does, it ends up boiling down to to managing your data, especially when you have a lot of it, it kind of becomes the the primary challenge, and limits a lot of what you're able to do. So kind of the more tools and techniques you you have to address those issues and use those kind of design tools, the further you can get, I think,
Tobias Macey
0:03:09
and so you both [email protected], and you have begun work on this vector project. So I'm wondering if you can explain a bit about what it is and the overall reason that you had for creating it in the first place? Yeah, sure.
Ben Johnson
0:03:21
So on this on the most basic sounds of vectors, and observable the data router and collects data from anywhere in your infrastructure, whether that be a log file over TCP socket, and can be stats, D metrics, and then vector is designed to ingest that data and then send it to multiple sources. And so the idea is that it is sort of vendor agnostic and collects data from many sources and sends it to many things. And the reason we created it was really, for a number of reasons, I would say, one, you know, being an observer ability company, and when we initially launched number, it was a hosted a blogging platform. And we needed a way to collect our customers data, we tried writing our own, initially and go that was very just kind of specific to our platform. That was that was very difficult. We started using off the shelf solutions, and found those also be difficult, we were getting a lot of support requests, it was hard for us to contribute and debug them. And then I think in general, you know, our our ethos as a company is we want to create a world where developers have choice and are locked into specific technologies are able to move with the times choose best in class tools for the job. And that's kind of what prompted us to start vectors. That vision, I think, is enabled by an open collector that is vendor agnostic, and meets a quality standard that makes people want to use it. So it looks like we have other areas in this podcast where we'll get into some of the details there. But we really wanted to raise the bar on the open collectors and start to give control and ownership back to the people, the developers that were deploying vector.
Tobias Macey
0:05:14
And as you mentioned, there are a number of other off the shelf solutions that are available. Personally, I've had a lot of experience with fluent D. And I know that there are other systems coming from the elastic stack and other areas. I'm curious, what are some of the tools that you consider to be comparable or operating in the same space? And any of the ones that you've had experience with that you found to be lacking? And what were the particular areas that you felt needed to be addressed that weren't being handled by those other solutions?
Ben Johnson
0:05:45
Yeah, I think that's a really good question. So first, I would probably classify the collectors as either open or not. And so typically, I wouldn't, we're not too concerned with vendor specific collectors, like the spawn corridor, or anything another sort of, you know, thing that just ships data, one service. So I'd say that, you know, in the category of just comparing tools, I'll focus on the ones that are open, like you said, flinty filebeat LogStash, like, I think it's questionable that they're completely open. But I think we're more comparable to those tools. And then I'd also say that, like, we're, we typically try to stay away from like, I don't want to say anything negative about the projects, because I, a lot of them were pieces of inspiration for us. And so, you know, we respect the fact that they are open, and they were solving a problem at the time. But I'd say one of the projects that that really, we thought was a great alternative and inspired us is one called Cernan. It was built by Postmates. It's also written and rest. And that kind of opened our eyes a little bit like a new bar, a new standard that you could set with these, these collectors. And we actually know Brian Trautwein, he was one of the developers that worked on it. He's been really friendly and helpful to us. But the sort of thing that the reason we didn't use certain is like one, it's, it was created out of necessity of Postmates. And it doesn't seem to be actively maintained. And so that's one of the big reasons we started vector. And so I would say that's, that's something that's lacking is like, you know, a project that is actively maintained and is in it for the long term. Obviously, that's, that's important. And then in terms of like actual specifics of these projects. There's a lot that I could say here. But you know, on one hand, we've seen a trend of certain tools that are built for a very specific storage, and then sort of like backed into supporting more sinks. And it seems like the like incentives and sort of fundamental practices of those tools are not aligned with many disparate storage is that kind of ingest data differently, for example, like the fundamentals of like badging and stream processing, I think thinking about those two ways of like, collecting data and sending it downstream kind of don't work for every single storage that you want to support. The other thing is just the obvious ones like performance, reliability, having no dependencies, you know, if you're not a strong Java shop, you probably aren't comfortable deploying something like LogStash and then managing the JVM and everything associated with that. And, yeah, I think another thing is we want to collector that was like, fully vendor agnostic and vendor neutral. And a lot of them don't necessarily fall into that bucket. And as I said before, like that's something we really strongly believe in as an observer ability world where developers can rely on a best in class tool that is not biased and has aligns incentives with the people using it, because there's just so many benefits that stem off of that.
Tobias Macey
0:08:51
And on the point of sustainability, and openness, I'm curious, since you are part of a company, and this is and some ways related to the product offering that you have how you're approaching issues, such as product governance and sustainability and ensuring that the overall direction of the project is remaining impartial and open and trying to foster a community around it so that it's not entirely reliant on the direction that you try to take it internally and that you're incorporating input from other people who are trying to use it for their specific use cases.
Ben Johnson
0:09:28
Yeah, I think that's a great question.
0:09:31
So one is we want to be totally transparent on everything, like everything we do with vector discussions, specifications, roadmap planning, it's all available on GitHub. And so nothing is private there. And we want factor to truly be an open project that anyone can contribute to. And then, in terms of like governance and sustainability, like we try to do a really good job. just maintaining the project. So number one is like good as you management, like making sure that that's, that's done properly, helps people like search for issues helps them understand like, which issues need help, like what are good first issues to start contributing on. We wrote an entire contribution guide and actually spent good time and put some serious thought into that so that people understand like, what are the principles of vector and like, how do you get started. And then I think the other thing that really sets vector apart is like the documentation. And I think that's actually very important for sustainability. And helping to it's really kind of like a reflection of your projects, respect for the users in a way. But it also serves as a really good opportunity to like explain the project and help people understand like the internals the project, and how to how to contribute to it. So really kind of all comes together. But I'd say yeah, the number one thing is just transparency, and making sure everything we do is out in the open.
Tobias Macey
0:11:00
And then in terms of the use cases, that vector enables, obviously, one of them is just being able to process logs from a source to a destination. But in the broader sense, what are some of the ways that vector is being used both at timber and with other users and organizations that have started experimenting with it beyond just the simple case.
Ben Johnson
0:11:20
So first, like vectors, news are, we're still learning a lot as we go. But, you know, the core use cases, the business use cases we see is there's everything from reducing cost vectors quite a bit more efficient, the most collectors out there. So just by deploying vector, you're going to be using less CPU cycles, less memory, and you'll have more of that available for the app that's running on that server, how side of that it's like the fact that vector enables choosing multiple storage is and the storage that is best for your use case, lets you reduce costs as well. So for example, you know, like, if you're running an elk stack, you don't necessarily want to use your, before archiving, you can use another cheaper, durable storage for that purpose and sort of take the responsibility out of your elk stack. And that reduces costs in that way. So I think that's like an interesting way to use vector. Another one is, like I said before, reducing lock in that use cases is so powerful, because it gives you the agility, choice control, ownership of your data. Transitioning vendors is a big use case, we've seen so many companies we talked to or bogged down and locked in to vendors, and they're tired of paying the bill, but they don't see a clean way out. And like observer abilities. And it is an interesting problem. Because it's not just technology coupling, like there are human workflows that are coupled with the systems you're using. And so transitioning out of something that maybe isn't working for your organization anymore, requires a bridge. And so a vector is a really great way to do that, like deploy vector, continue sending to whatever vendor you're using. And then you can, at the same time start to try out other stages. And like other setups without disrupting, like the human workflows in your organization, and I can keep going, there's data governance, we've seen people you know, cleaning up their data and forcing schemas, security and compliance, you have the ability to like scrub sensitive data at the source before it even goes downstream. And so you know, again, like having a good open tool like this is so incredibly powerful, because of all of those use cases that it enables. And, like, lets you take advantage of those when you're ready.
Tobias Macey
0:13:36
In terms of the actual implementation of the project, you've already mentioned in passing that it was written in rust. And I'm wondering if you can dig into the overall system architecture and implementation of the project and some of the ways that it has evolved since you first began working on it, like you said, rust is,
Luke Steensen
0:13:53
I mean, that's kind of first thing everybody looks at certain interest.
0:13:57
And kind of on top of that, we're we're building with the, like the Tokyo asynchronous i O, kind of stack of, of libraries and tools within the rust ecosystem. Kind of from the beginning, we we've started vector, pretty simple, architecturally. And we're kind of we have an eye on, on where we'd like to be. But we're trying to get there very, very incrementally. So at a high level, each of the internal components of vectors is generally either a source of transform or sink. So, so probably familiar terms, if you if you dealt with this type of tool before, but sources are something that helps you ingest data and transforms, different things you can do like parsing JSON data into, you know, our internal data format, doing regular expression, value extracting, like Ben mentioned, and forcing schemas, all kinds of stuff like that. And then syncs, obviously, which is where we will actually forward that data downstream to some external storage system or service or something like that. So that those are kind of the high level pieces, we have some different patterns around each of those. And obviously, there's different different flavors. You know, if you're, if you have a UDP sis logs source, that's, that's going to look and operate a lot differently than a file tailing source. So there's a lot of those, there's a lot of different styles of implementation, but they all we kind of fit them into those three buckets of source, transform and sink. And then the way that you configure vector, you're, you're basically building a data flow graph, where data comes in through a source, flow through any number of transforms, and then down the graph into a sync or multiple things, we try to keep it as flexible as possible. So you can, you can pretty much build like an arbitrary graph of data flow, obviously, there are going to be situations where that that isn't, you know, you could build something that's this pathological or won't perform well, but we've kind of leaned towards giving users the flexibility to do what you want. So if you want to, you know, parse something as JSON and then use a reg ex to extract something out of one of those fields, and then enforce a schema and drop some fields, you can kind of chain all these things together. And you can, you can kind of have them fan out into different transforms and merge back together into a single sink or feed two sinks from the same transform, output, all that kind of stuff. So basically, we try to keep it very flexible, we definitely don't advertise ourselves as like a general purpose stream processor. But there's a lot of influence from working with those kinds of systems that has found its way into the design of vector.
Tobias Macey
0:17:09
Yeah, the ability to map together different components of the overall flow is definitely useful. And I've been using fluid D for a while, which has some measure of that capability. But it's also somewhat constrained in that the logic of the actual pipeline flow is dependent on the order of specification and the configuration document, which is sometimes a bit difficult to understand exactly how to structure the document to make sure that everything is functioning as properly. And there are some mechanisms for being able to route things slightly out of band with particular syntax, but just managing it has gotten to be somewhat complex. So when I was looking through the documentation for vector, I appreciated the option of being able to simply say that the input to one of the steps is linked to to the ID of one of the previous steps so that you're not necessarily constrained by order of definition, and that you can instead just use the ID references to ensure that the flows are Yeah, that
Luke Steensen
0:18:12
was definitely really something that we spent a lot of time thinking about. And we still spend a lot of time thinking about, because, you know, if you kind of squint at these config files, they're, they're kind of like a program that you're writing, you know, you have data inputs and processing steps and data outputs. So you, you want to make sure that that flow is clear to people. And you also want to make sure that, you know that there aren't going to be any surprises you don't want. I know a lot of tools, like you mentioned, have this as kind of an implicit part of the way the config is written, which can be difficult to manage, we wanted to make it as explicit as possible. But also in a way that is still relatively readable. From a, you know, just when you open up the config file, we've gone with a pretty simple toggle format. And then like you mentioned, you just kind of mentioned, you just kind of specify which input each component should draw from, we have had some kind of ideas and discussions about what our own configuration file format would look like. I mean, we've what we would love to do eventually is make these kind of pipelines as much as as pleasant to write as something like, like a bash pipeline, which we think that's another really powerful inspiration for us. Obviously, they have their limitations. But the things that you can do, just in a bash pipeline with a, you have a log file, you grab things out, you run it through, there's all kinds of really cool stuff that you can do in like a lightweight way. And that's something that we've we've put a little thought into, how can we be as close to that level of like, power and flexibility, while avoiding a lot of the limitations of, you know, obviously, being a single tool on a single machine. And, you know, I don't want to get into all the, the gotchas that come along with writing bash one liners, obviously, there, there are a lot. But if we want something that we want to kind of take as much of the good parts from as possible.
Tobias Macey
0:20:33
And then in terms of your decision process for the actual runtime implementation for both the actual engine itself, as well as the scripting layer that you implemented in Lua? What was the decision process that went into that as far as choosing and settling on rust? And what were the overall considerations and requirements that you had as part of that decision process.
Luke Steensen
0:20:57
So from a high level, the thing that we thought were most important when writing this tool, which, which is obviously going to run on other people's machines, and hopefully run on a lot of other people's machines, we want to be, you know, respectful of the fact that they're, you know, willing to put our tool on a lot of their, their boxes. So we don't want to use a lot of memory, we don't want to use a lot of CPU, we want to be as resource constrained as possible. So So efficiency is a big was a big point for us. Which rust obviously gives you the ability to do, there's you know, I'm a big fan of Russ. So I could probably talk for a long time about all the all the wonderful features and things. But honestly, the fact that it's a it's a tool that lets you write, you know, very efficient programs, control your memory use pretty tightly. That's somewhere that we I have a pretty big advantage over a lot of other tools. And then just I was the first engineer on the project, and I know rust quite well. So just kind of the the human aspect of it, it made sense for us, we're lucky to have a couple people at timber who are who are very, very good with rest very familiar and involved in the community. So it has worked out, I think it's a it's worked out very well from the embedded scripting perspective, Lua was kind of an obvious, obvious first choice for us. There's very good precedent for for using Lua in this manner. For example, in engine x and h a proxy. They both have local environments that lets you do a lot of amazing things that you would maybe never expect to be able to do with those tools, you can write a little bit of Lua. And there you go, you're all set. So lou is very much built for this purpose. It's it's kind of built as an embedded language. And there were, there's a mature implementation of bindings for us. So didn't take a lot of work to integrate Lua and we have a lot of confidence that it's a reasonably performant reliable thing that we can kind of drop in and expect to work. That being said, it's it's definitely not the end all be all. We know that while people can be familiar with Lua from a lot of different areas where it's used, like gaming and our game development. And like I mentioned some observe ability tools or infrastructure tools, we are interested in supporting more than just Lua, we actually have a work in progress, JavaScript transform, that will allow people to kind of use that as an alternative engine for transformations. And then a little bit longer term we this is we kind of want this area to mature a little bit before we dive in. But the was awesome space has been super interesting. And I think that, particularly from a flexibility and performance perspective could give us a platform to do some really interesting things in the future.
Tobias Macey
0:24:06
Yeah, I definitely think that the web assembly area is an interesting space to keep an eye on because of the fact that it is, in some ways being targeted as sort of a universal runtime that multiple different languages can target. And then in terms of your choice of rust, another benefit that it has, when you're discussing the memory efficiency is the guaranteed memory safety, which is certainly important when you're running it in customer environments, because that way, you're less likely to have memory leaks or accidentally crashed their servers because of a bug in your implementation. So I definitely think that that's a good choice as well. And then one other aspect of the choice of rest for the implementation language that I'm curious about is how that has impacted the overall uptake of users who are looking to contribute to the project either because they're interested in learning rust, but also in terms of people who aren't necessarily familiar with the Ruston any barriers that that may pose,
Luke Steensen
0:25:02
it's something that's kind of hard, it's hard to know, because obviously we can we didn't can't inspect the alternate timeline where we we wrote it and go or something like that, I would say that there's kind of there's there's ups and downs from a lot of different perspectives from like a from an developer interest perspective, I think rust is something that a lot of people find interesting. Now the, the sales pitch is a good one, and a lot of people find it compelling. So I think it's definitely, you know, it's caught a few more people's interest because it happens to be written in rust, we, we try not to push on that too hard, because of course, there's, there's the other set of people who, who do not like rust and are very tired of hearing about it. So, you know, we love it, and we're very happy with it. But we try not to make it, you know, a primary marketing point or anything like that. But I think it does, it does help to some degree. And then from a contributions perspective, again, it's hard to say for sure, but I do know from experience that we have had, you know, a handful of people kind of pop up from the open source community and give us some some really high quality contributions. And we've been really happy with that, like I said, we can't really know how that would compare to, if we had written it in a language that more people are proficient in. But the contributions from the community that we have seen so far have been, like I said, really high quality, and we're really happy with it, the the JavaScript transform that I mentioned, is actually something that's a good example of that we had a contributor come in and do a ton of really great work to, to make that a reality. And it's something that we're pretty close to being able to merge and ship. So that's something that I definitely shared a little bit of that concern, I was like, I know, rust, at least has the reputation of being a more difficult to learn language. But at the the community is there, there's a lot of really skilled developers that are interested in rust and you know, would love to have an open source project like vector that they can contribute to. And we've seen,
Tobias Macey
0:27:12
we've definitely seen a lot of benefit from that, in terms of the internals of vector, I'm curious how the data itself is represented once it is ingested in the sink, and how you process it through the transforms, as far as if there's a particular data format that you use internally in memory, and also any capability for schema enforcement as it's being flowed through vector out to the sinks.
Luke Steensen
0:27:39
Yeah, so right now we have our own internal our own in memory, data format, it's kind of it's a little bit, I don't want to say thrown together. But it's something that's been incrementally evolving pretty rapidly as we build up the number of different sources and things that we support. This was actually something that we deliberately kind of set out to do when we were building vectors, we didn't want to start with the data model. You know, there are some projects that do that. And that's, I think, there's definitely a space for that. But the data modeling and the observe ability, space is, is not always the best. But we explicitly kind of wanted to leave that other people. And we were going to start with the simplest possible thing. And then kind of add features up as we found that we, we needed them in order to better support the data models of the downstream sinks and the transforms that we want it to be able to do. So from from day one, that the data model was basically just string, you know, you send us a log message, and we represented as a as a string of characters. Obviously, it's grown a lot since then. But we basically now support we call them events internally, it's kind of our, our vague name for everything that flows through the system. events can be a log, or they can be a metric through a metric, we support a number of different types, including counters, gauges, kind of all your standard types of metrics from like the stats, D family of tools, and then logs, that can be just a message. Like I said, just a string, we still support that as much as we ever have. But we also support more structured data. So right now, it's a flat map of string, you know, a map of string to something, we have a variety of different types that the values can be. And that's also something that's kind of growing as we want to better support different tools. So right now, it's kind of like non nested JSON ish representation. In memory, we don't actually see realize it to JSON, we support a few extra types, like timestamps, and things like that, that are important for our use case. But in general, that's kind of how you can think about it, we have, we have a protocol buffers schema for that data format that we use when we serialized to disk when we do some of our on disk buffering. But that is I wouldn't that's necessarily the primary representation, we when you work with it in a in a transform your, you're looking at that, that in memory representation that like I said, kind of looks a lot like JSON. And that's something that we're we're kind of constantly reevaluating and thinking about how we want to evolve. I think kind of the next, the next step in that evolution is to make it not necessarily just a flattened map and move it towards supporting like, nested maps, nested keys. So it's going to move more towards like an actual, you know, full JSON, with better types and support for things like that.
Tobias Macey
0:30:39
And on the reliability front, you mentioned briefly the idea of disk buffering. And that's definitely something that is necessary for the case where you need to restart the service and you don't want to lose messages that have been sent to an aggregator node, for instance, I'm curious, what are some of the overall capabilities in vector the support this reliability objective, and also, in terms of things such as malformed messages, if you're trying to enforce the schema, if there's any way of putting those into dead letter Q for reprocessing, or anything along those lines?
Luke Steensen
0:31:14
Yeah, dead letter Q specifically, isn't something that we support at the moment. That's it is something that we've been thinking about, and we do want to come up with a good way to support that. But currently, that isn't something that we have a lot of transforms like the the schema enforcement transform will end up just just dropping the events that don't, or it will, if it can't enforce that they do meet the schema by dropping fields, for example, it will, it will just drop the event, which, you know, we're we recognize the the shortcomings there. I think one of the one of the things that is a little bit nice from an implementation perspective about working in the observe ability space, as opposed to, you know, the normal data streaming world with application data, is that people can be a little bit more, there's more of an expectation of best effort, which is something that we're willing to take advantage of a little bit in, like the very beginning early stages of a certain feature or tool, but it but it's definitely a part of the part of the ecosystem that we want to push forward. So it's, that's something that we we try to keep in mind as we build all this stuff is yes, it might be okay. Now, we may have parody with other tools, for example, if we just got messages that don't need a certain schema, but you know, how can we how can we do better than that other tools that are other kind of things in the toolbox that you can reach for for this type of thing, or, I mean, the most basic one would be that you can send data to multiple things. So if you have a kind of classic sis log like setup, where you're forwarding logs around, and it's super, super simple to just add, secondary, that will forward to both CES log aggregator a and CES log aggregator be. That's, that's, that's nothing particularly groundbreaking, but it's something that is kind of the start beyond that. I mentioned, the the disk buffer where we want to make do it as good a job as we can, ensuring that we don't lose your data, once you have sent it to us, we are we are still a single node tool at this point where we're not a distributed storage system. So there are going to be some inherent limitations in in the guarantees that we can provide you there, we do recommend, you know, if you if you really want to make sure that you're not losing any data at all vector is going to is not going to be able to give you the guarantees that something like Kafka would. So we want to make sure that we work well with tools like Kafka, that are going to give you pretty solid, you know, redundant reliable distributed storage guarantees. Let's see, other than those two, we writing the tool in rust is, you know, kind of an indirect way that we want to try to make it just as reliable as possible. I think rust has a little bit of a reputation for making it tricky to do things, you know that the compiler is very picky and wants to make sure that everything you're doing is safe. And that's something that you can you definitely take advantage of to kind of guide you in writing, you mentioned, like memory safe code, but it kind of expands beyond that into ensuring that every error that pops up, you're going to have your handling explicitly at that level or a level above and things like that it kind of guides you into writing more reliable code by default, obviously, it's still on you to make sure that you're covering all the cases and things like that, but it definitely helps. And then moving forward, we're we're going to spend a lot of time and the very near future, setting up certain kind of internal torture environments, if you will, where we can run vector for long periods of time and kind of induce certain failures in the network and you know, the upstream services, maybe delete some data from underneath it on disk and that kind of thing. kind of familiar, if you're familiar with the the Jepsen suite of database testing tools, obviously, we don't have quite the same types of invariance that an actual database would have. But I think we can use a lot of those techniques to kind of stress vector and see how it responds. And like I said, we're going to be limited and what we can do based off of the fact that we're a single node system. And you know, if you're sending us data over UDP, there's not a ton of guarantees that we're going to be able to give you. But to the degree that we're able to give guarantees, we very much would like to do that. So that's that's definitely is a focus of ours, we're going to be exploring that as much as possible.
Tobias Macey
0:36:03
And then, in terms of the deployment, apologies that are available, you mentioned one situation where you're forwarding to a Kafka topic. But I'm curious what other options there are for ensuring high availability, and just the overall uptime of the system for being able to deliver messages or events or data from the source to the various destinations.
Luke Steensen
0:36:28
Yeah, there are a few different kind of general topology patterns that we you know, we've documented and we we recommend to people, I mean, the simplest one, depending on how your infrastructure setup can just be to run vector on each of your, you know, application servers, or whatever it is that you have. And kind of run them there in a very distributed manner. And forward to, you know, if you are sending it to a certain upstream logging service or, or something like that, you can kind of do that where it's, you don't necessarily have any aggregation happening in your infrastructure. That's pretty easy to get started with. But it does have limitations. If you know, if you don't want to allow outbound internet access, for example, from from each of your nodes, the next kind of step, like you mentioned is, you know, we would support a second kind of tier of vector running maybe on a dedicated box, and you would have a number of nodes forward to this more centralized aggregator node, and then that node would forward to whatever other you know, sinks that you have in mind, that's kind of the second level of complexity, I would say, you do get some benefits in that you have some more power to do things like aggregations and sampling in a centralized manner, there's going to be certain things that you can't necessarily do if you never bring the data together. And you can do that, especially if you're looking to reduce cost, it's nice to be able to have that that aggregator node kind of as a as a leverage point where you can bring everything together, evaluate what is, you know, most important for you to forward to different places, and do that there. And then kind of the, for people who want more reliability than a, you know, a single aggregation node at this point, our recommendation is something like Kafka that that's going to give you distributed durable storage, we that that is a big jump in terms of infrastructure complexity. So there's definitely room for some in betweens there that we're working on in terms of, you know, having a fail over option. Like right now, you could put a couple aggregator knows bomb behind a TCP load balancer or something like that, that's not necessarily going to be the best experience. So we're kind of investigating our options there to try to give people a good range of choices for you know, how much they're willing to invest in the infrastructure, and what kind of reliability and robustness benefits that they
Tobias Macey
0:39:19
that they need. Another aspect of the operational characteristics of the system are being able to have visibility into particularly at the aggregate or level, what the current status is of the buffering or any errors that are cropping up, and just the overall system capacity. And I'm curious if there's any current capability for that, or what the future plans are along those lines.
Luke Steensen
0:39:44
Um, yeah, we have some we have a setup for for kind of our own internal metrics at this point, that that is another thing that we're kind of alongside the liabilities, stuff that you mentioned that that we're really looking at very closely right now. And what what we want to do next, we've kind of the way we've set ourselves up, we have kind of an event based system internally, where it can be emitted normally as log events, but we also have the means to essentially send them through something like, like a vector pipeline, where we can do aggregations, and kind of filter and sample and do that kind of stuff to get better insight into what's happening in the process. So we haven't gotten as far as I'd like down that road at this point. But I think we have a pretty solid foundation to do some some interesting things. And and it's going to be definitely a point of focus in the next, you know, few weeks.
Tobias Macey
0:40:50
So in terms of the overall roadmap, you've got a fairly detailed set of features and capabilities that you're looking to implement. And I'm wondering what your decision process was in terms of the priority ordering of those features, and how you identified what the necessary set was for a one dot o release.
Ben Johnson
0:41:12
So initially, when we planned out the project, you know, our roadmap was largely influenced by our past experiences, you know, not only supporting timber customers, but running around observer building tools. And just based on the previous questions you asked, was obvious to us that we would need to support those different type of deployment models. And so a lot of it's a part of the roadmap was dictated by that. So you can see, like, later on the roadmap, we want to support stream processors, so we can enable that sort of deployment topology. And, yeah, it was kind of, it's very much evolving, though, as we learn and kind of collect data from customers and their use cases, we're actually are going to make some changes to it. But and in terms of a 1.0 release, like everything that you see, and the roadmap on GitHub, which are represented as milestones, we think that sort of represents, like a 1.0 release for us represents something a reasonably sized company could deploy and rely on vector. And so, you know, again, given our experience, a lot of that is dependent on Kafka, or some sort of some sort of more complex topology, as it relates to collecting your data and routing it downstream.
Tobias Macey
0:42:38
And then, in terms of the current state of the system, how would you characterize the overall production readiness of it, and whatever, and any features that are currently missing that you think would be necessary for a medium to large scale company to be able to adopt it readily?
Ben Johnson
0:42:55
Yeah. So in terms of like a one point release, where we would recommend it to for like, very stringent production use cases. I think what Luke just talked about internal metrics, I think it's really important that we improve vectors own internal observer ability, and provide operators the tools necessary to monitor performance, set up alarms and make sure that they have confidence in factor internally, the stress testing is also something that would raise our confidence, and that we have a lot of interesting stress testing use cases that we want to run vector through. And I think that'll expose some problems. But I think getting that done would definitely raise our confidence. And then I think there's just some like, General house cleanup that I think would be helpful for one point or release. Like, you know, the initial stages of this project have been inherently a little more messy, because we are building out the foundation and moving pretty quickly with our integrations. I would like to see that settle down more when we get to 1.0. So that we have smaller increment releases, and we take breaking changes incredibly seriously factors reliability, and sort of least surprise, philosophy definitely plays into like how we're releasing the software and making sure that we aren't releasing a minor update that actually has breaking changes in it, for example. So I would say those are the main things missing. Before we can officially call it one point O outside of that, the one other thing that we want to do is provide more education on some high level use cases around vector. I think right now, it's like the documentation is is very good. And that it, like dives deep into all the different components like sources, sinks and transforms and all the options available. But I think we're lacking in more guidance around like how you deploy vector and an AWS environment or a GCC environment. And that's, that's certainly not needed for 1.0. But I think it is one of the big missing pieces that will make Dr. More of a joy to us
Tobias Macey
0:44:55
in terms of the integrations, what are some of the ways that people can add new capabilities to the system? Does it require being compiled into the static binary? Or are there other integration points where somebody can add a plug in. And then also, in terms of just use of the system, I'm curious what options there are as far as being able to test out a configuration to make sure that the content flow is what you're expecting.
Luke Steensen
0:45:21
So in terms of plugins, basically, that's the we don't have a strong concept of that right now, all of the transforms that I've mentioned, sources and sinks are all written in rust and kind of natively compiled into the system that has a lot of benefits, obviously, in terms of performance, and we get to make sure that everything fits in perfectly ahead of time. But obviously, it's it's not quite as extensible as we'd like at that point. So there, there are a number of strategies that we've we've thought about for allowing kind of more user provided plugins, I know, I know, that's a big thing feature of fluent D that people get a lot of use out of. So it is something that we'd like to support. But we want to be careful how we do it. Because, you know, we don't want to give up our core strength necessarily, which I'd say, you know, the kind of the performance and robustness reliability of the system, we want to be careful how we expose those extension points to kind of make sure that the system as a whole maintains those properties that that we find most valuable. So
Ben Johnson
0:46:29
yeah, and that's to echo Lake on that, like we've seen, you know, plugin plugin ecosystems are incredibly valuable, but they can be very dangerous, like they can ruin a projects reputation as it relates to reliability and performance. And we've seen that firsthand with a lot of the different interesting fluidity setups that we've seen with our customers, they'll use off the shelf plugins that aren't necessarily written properly or maintained actively, and it just implements, it adds this variable to just the discussion of running vector that makes it very hard to ensure that it's meeting the reliability and performance standards that we want to meet. And so I would say that if we do introduce the plugin system, there will be it'll be quite a bit different than I think what people are expecting. That's something that we're taking, we're putting a lot of thought into. And, you know, to go back to some of the things you said before, it's like we've had community contributions, and they're very high quality. But those still go through a code review process that exposes quite a bit of quite a bit of like, fundamental differences and issues in the code that would have otherwise not been taught. And so it's, it's an interesting kind of like conundrum to be in is like I, on the one hand, we like that process, because it ensures quality on the other it is a blocker and certain use cases.
Luke Steensen
0:47:48
Yeah, I think our strategy there so far has basically been to allow program ability in limited places, for example, the Lua transform and the kind of upcoming JavaScript transform, there's got kind of a surprising amount that you can do, even when you're limited to that, to that context of a single transformation, we are interested in kind of extending that to say, you know, is it would it make sense to have a sink or a source that you could write a lot of the logic in, in something like Lua or JavaScript or, you know, a language compiled to web assembly. And then we provide almost like a standard library of you know, io functions and things like that, that would we would have more control over and could do a little bit more to ensure, like Ben said, the performance and reliability of the system. And then kind of the final thing is we really want vector to be as as easy to contribute to as possible. Ben mentioned, some, you know, housekeeping things that we want to want to do before we really considered at 1.0. And I think a lot of that is around extracting common patterns for things like sources things and transforms into to kind of our internal library so that if you want to come in and contribute support to vector for a new downstream database, or metric service or something like that, we want that process to be as simple as possible. And we want you to kind of be guided into the right path in terms of, you know, handling your errors and doing retrials by default, and all all of that stuff, we want it to be right there and very easy. So that we can minimize, there's always going to be a barrier if you say you have to write a pull request to get support for something as opposed to just writing a plugin. But we want to minimize that as much as we possibly can.
Tobias Macey
0:49:38
And there are a whole bunch of other aspects of this project that we haven't touched on yet that I have gone through in the documentation that I think is interesting, but I'm wondering if there is anything in particular that either of you would like to discuss further before we start to close out the show.
Ben Johnson
0:49:55
And in terms of like the actual technical implementation of vector, I think one of the unique things things that is worth mentioning is one of you know, vectors, and tend to be the single data collector, across all of your different types of data. So we think that's a big gap in the industry right now is that you need different tools for metrics and logs, and exceptions, and traces. And so we think we can really simplify that. And that's one of the things that we didn't touch on very well in this in this podcast. But right now, we support logs and metrics. And we're considering expanding support for different types of observe ability data, so that you can claim full ownership and control of collection of that data
Luke Steensen
0:50:36
and routing of it. Yeah, I mean, I could there, you know, small little technical things within vector that I think are neat talking about for a little while, but I mean, for me, the most interesting part of the project is kind of viewing it through the lens of being a kind of a platform that you program that it's you know, as flexible and programmable, I guess as possible, kind of in the in the vein of you know, those bash one liners that I talked about, that's something that it you know, that can be a lot of fun can be very productive. And the challenge of kind of lifting that thing that you do in the small on your own computer or on a single server or something like that up to a distributed context, I find it you know, a really interesting challenge. And there's a lot of fun little pieces that you get to put together as you try to try to move in that direction.
Tobias Macey
0:51:28
Well, I'm definitely going to be keeping an eye on this project. And for anybody who wants to follow along with you, or get in touch with either of you and keep track of the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, and I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Luke Steensen
0:51:49
For me, I think there's there's so many interesting stream processing systems, databases, tools, and things like that. But there hasn't been quite as much attention paid to the glue, like how do you get your data in? How do you integrate these things together, and that ends up being like a big barrier for getting people to get into these tools and get a lot of value out of them, there's just, there's a really high activation energy. Already, it's kind of assumed that you're already bought into a given ecosystem or something like that, that I mean, that's the biggest thing for me is that it, a lot of for a lot of people and a lot of companies, it takes a lot of engineering effort to get to the point where you can do interesting things with these tools.
Ben Johnson
0:52:33
And like as an extension of that, like that doesn't go just from the collection side, it goes all the way to the analysis side of that, as well. And we think that if, if, you know, are either September's to help empower users to accomplish that, and take ownership of their data and their observer ability strategy, and so like vector is the first project that we're kind of launching and that initiative, but we think it goes all the way across. And so that that like to echo Luke, that is the biggest thing, because so many people get so frustrated with it, where they just throw their hands up and kind of like hand their money over to a vendor, which is, which is fine and a lot of use cases, but it's not empowering. And there's, you know, there's no like silver bullet, like there's no one storage, or one vendor that is going to do everything amazing. And so at the end of the day, it's like being able to take advantage of all the different technology and tools and combine them into like a cohesive observe ability strategy in a way that is flexible. And the to evolve with the times is like the Holy Grail. And that's what we want to enable. And we think, you know, that processes needs quite a bit of improvement.
Tobias Macey
0:53:43
I appreciate the both of you taking your time today to join me and discuss the work that you're doing on vector and timber. It's definitely a very interesting project and one that I hope to be able to make use of soon to facilitate some of my overall data collection efforts. So if appreciate all of your time and effort on that, and I hope you enjoy the rest of your day.
Ben Johnson
0:54:03
Thank you. Yeah, and and just to kind of add to that, if if anyone listening like wants to get involved, ask questions. We have a there's a link community link on the repo itself. You can chat with us. We want to be really transparent and open and we're always welcoming conversations around things we're doing. Yeah, definitely.
Luke Steensen
0:54:22
Just want to emphasize everything Ben said. And thanks so much for having us.
Ben Johnson
0:54:26
Yeah, thank you.
Tobias Macey
0:54:32
For listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used to visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast calm but your story and to help other people find the show please leave a review on iTunes. Tell your friends and coworkers

Customer Analytics At Scale With Segment - Episode 72

Summary

Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Segment is and how the business got started?
    • What are some of the primary ways that your customers are using the Segment platform?
    • How have the capabilities and use cases of the Segment platform changed since it was first launched?
  • Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the overall structure of Segment and the driving force behind their design and use?
  • What are some of the best practices for structuring custom events in a way that they can be easily integrated with downstream platforms?
    • How do you manage changes or errors in the events generated by the various sources that you support?
  • How is the Segment platform architected and how has that architecture evolved over the past few years?
  • What are some of the unique challenges that you face as a result of being a many-to-many event routing platform?
  • In addition to the various services that you integrate with for data delivery, you also support populating of data warehouses. What is involved in establishing and maintaining the schema and transformations for a customer?
  • What have been some of the most interesting, unexpected, and/or challenging lessons that you have learned while building and growing the technical and business aspects of Segment?
  • What are some of the features and improvements, both technical and business, that you have planned for the future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Summary

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pravega is and the story behind it?
  • What are the use cases for Pravega and how does it fit into the data ecosystem?
    • How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?
  • How do you represent a stream on-disk?
    • What are the benefits of using this format for persisted streams?
  • One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?
  • I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?
  • For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?
  • What are some of the unique system design patterns that are made possible by Pravega?
  • How is Pravega architected internally?
  • What is involved in integrating engines such as Spark, Flink, or Storm with Pravega?
  • A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?
    • Does it have any special capabilities for simplifying processing of out-of-order events?
  • For someone planning a deployment of Pravega, what is involved in building and scaling a cluster?
    • What are some of the operational edge cases that users should be aware of?
  • What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?
  • What are some cases where you would recommend against using Pravega?
  • What is in store for the future of Pravega?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what PipelineDB is and the motivation for creating it?
    • What are the major use cases that it enables?
    • What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
  • What are the major concepts and components that users of PipelineDB should be familiar with?
  • Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
  • What are some of the common patterns for populating data streams?
  • What are the options for scaling PipelineDB systems, both vertically and horizontally?
    • How much elasticity does the system support in terms of changing volumes of inbound data?
    • What are some of the limitations or edge cases that users should be aware of?
  • Given that inbound data is not persisted to disk, how do you guard against data loss?
    • Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
    • Can a separate table be used as an input stream?
  • Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
  • What are some of the features that you have found to be the most useful which users might initially overlook?
  • What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
  • What are some of the most challenging aspects of building continuous aggregates on unbounded data?
  • What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
  • What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
  • When is PipelineDB the wrong choice?
  • What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Spark is?
    • What are some of the main use cases for Spark?
    • What are some of the problems that Spark is uniquely suited to address?
    • Who uses Spark?
  • What are the tools offered to Spark users?
  • How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
  • For someone building on top of Spark what are the main software design paradigms?
    • How does the design of an application change as you go from a local development environment to a production cluster?
  • Once your application is written, what is involved in deploying it to a production environment?
  • What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
  • What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
  • What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
  • What are the limitations of the Spark programming model?
    • What are the cases where Spark is the wrong choice?
  • What was your motivation for writing a book about Spark?
    • Who is the target audience?
  • What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
  • What advice do you have for anyone who is considering or currently using Spark?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

  • Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Flink is and how the project got started?
  • What are some of the primary ways that Flink is used?
  • How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
    • What are some use cases that Flink is uniquely qualified to handle?
  • Where does Flink fit into the current data landscape?
  • How is Flink architected?
    • How has that architecture evolved?
    • Are there any aspects of the current design that you would do differently if you started over today?
  • How does scaling work in a Flink deployment?
    • What are the scaling limits?
    • What are some of the failure modes that users should be aware of?
  • How is the statefulness of a cluster managed?
    • What are the mechanisms for managing conflicts?
    • What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
    • Can state be shared across processes or tasks within a Flink cluster?
  • What are the comparative challenges of working with bounded vs unbounded streams of data?
  • How do you handle out of order events in Flink, especially as the delay for a given event increases?
  • For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
  • What are some of the most challenging or complicated aspects of building and maintaining Flink?
  • What are some of the most interesting or unexpected ways that you have seen Flink used?
  • What are some of the improvements or new features that are planned for the future of Flink?
  • What are some features or use cases that you are explicitly not planning to support?
  • For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
    • What do they find most interesting or exciting?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

  • Introductions
  • How did you get involved in the area of data engineering and data management?
  • What is Snowplow Analytics and what problem were you trying to solve when you started the company?
  • What is unique about customer event data from an ingestion and processing perspective?
  • Challenges with properly matching up data between sources
  • Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
    • Cleanliness/accuracy
  • What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
  • Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
    • How has that architecture evolved from when you first started?
    • What would you do differently if you were to start over today?
  • Ensuring appropriate use of enrichment sources
  • What have been some of the biggest challenges encountered while building and evolving Snowplow?
  • What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Summary

Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Alooma and what is the origin story?
  • How is the Alooma platform architected?
    • I want to go into stream VS batch here
    • What are the most challenging components to scale?
  • How do you manage the underlying infrastructure to support your SLA of 5 nines?
  • What are some of the complexities introduced by processing data from multiple customers with various compliance requirements?
    • How do you sandbox user’s processing code to avoid security exploits?
  • What are some of the potential pitfalls for automatic schema management in the target database?
  • Given the large number of integrations, how do you maintain the
    • What are some challenges when creating integrations, isn’t it simply conforming with an external API?
  • For someone getting started with Alooma what does the workflow look like?
  • What are some of the most challenging aspects of building and maintaining Alooma?
  • What are your plans for the future of Alooma?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Interview

Alan Anders from Applecart

  • What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?
  • What are the biggest technical hurdles at Applecart?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Stepan Pushkarev from Hydrosphere.io

  • What is Hydropshere.io?
  • What metrics do you track to determine when a machine learning model is not producing an appropriate output?
  • How do you determine which data points to sample for retraining the model?
  • How does the role of a machine learning engineer differ from data engineers and data scientists?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA