Companies

Navigating Boundless Data Streams With The Swim Kernel - Episode 98

Summary

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Swim.ai is and how the project and business got started?
    • Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?
  • What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable?
  • How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms?
  • Can you describe a typical design for an application or system being built on top of the Swim platform?
    • What does the developer workflow look like?
      • What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?
  • Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it?
  • For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality?
    • What mechanisms are in place to account for network failures?
  • Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application?
  • Since there is no explicit data layer, how is data redundancy handled by Swim applications?
  • What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used?
  • What have you found to be the most challenging aspects of building the Swim platform?
  • What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated?
  • What do you have planned for the future of the technical and business aspects of Swim.ai?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Lynn ODE with 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering, podcast.com, slash Lenovo, that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and media or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence data Council, upcoming events and do Riley AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Simon Crosby about swim.ai, the Data Fabric for the distributed enterprise. So Simon, can you start by introducing yourself?
Simon Crosby
0:02:28
Hi, I'm Simon Crosby, I am the CTO, I guess of long duration. I've been around for a long time. And it's a privilege to be with the swim folks who have been building this fabulous platform for streaming data for about five years.
Tobias Macey
0:02:49
And do you remember how you first got involved in the area of data management?
Simon Crosby
0:02:53
Well, I have a PhD in applied mathematics and probably, so I am kind of not data management guy. I'm an analysis guy. I like what comes out of, you know, streams of data and what influence you can draw from it. So my background is more on the analytical side. And then along the way, I saw begin to how to build big infrastructure for it.
Tobias Macey
0:03:22
And now you have taken up the position as CTO for swim.ai, I'm wondering if you can explain a bit about what the platform is and how the overall project and business got started?
Simon Crosby
0:03:33
Sure. So here's the problem. We're all reading all the time. But these wonderful things that you can do with machine learning, and streaming data, and so on, it all involves cloud and other magical things. And in general, most organizations chest don't know how to make head or tail of that, for a bunch of reasons, it's just too hard to get there. So if you're an organization, with assets, that are chipping out lots of data, and that could be a bunch of different types, you know, you probably don't have the skill fit in house to deal with a vast amount of information. And we're talking about boundless data sources, yet things that never showed up. And so to deal with these data flow pipelines to deal with it itself, to deal with the learning and inferences you might draw from that, and so on. And so, enterprises, a huge skill set challenge. There is also a cost challenge, because today's techniques related to drawing inference from data in general resolve with it, you know, in large, expensive, dead legs, either in house or perhaps in the cloud. And then finally, there's a challenge with the timeliness within which you can draw an insight. And most folks today, believe that you store data, and then you think about it in some magical way. And you draw inference from that. And we're all suffering from the Hadoop Cloudera, I guess, after effects, and really, this notion of storing and then analyzing needs to be dispensed with in terms of fast it, certainly for boundless data sources that will never stop. It's really inappropriate. So when I talk about boundaries, today, we're going to talk about data streams that just never stop. And Ross can talk web, the need to derive insights from that data on the fly, because if you don't, something will go wrong. So it's of the type that would stop your car before you hit the pedestrian, the crosswalk, that kind of stuff. So for that kind of data, there's just no chance to know still down hard disk. And then
Tobias Macey
0:06:16
and how would you differentiate the work that you're doing with the swimming AI platform and the swim OS kernel from things that are being done with tools such as Flink or other streaming systems, such as Kafka that is now got capabilities for being able to do some limited streaming analysis on the data as it flows through, or also platforms such as wall a room that are built for being able to do state for computations on data streams?
Simon Crosby
0:06:44
So first of all, there have been some major steps forward. And anything we do we stand on the shoulders of giants. Let's start off with distinguishing between the large enterprise skill set that's out there, and the cup world. And all the things you mentioned live in the cloud world. So at that reference distinction, most people in the enterprise when you said Flink wouldn't know what the hell you talking about. Okay, similarly will ruin anything else, they just wouldn't know what you're talking about. And so there is a major problem with the tools and technologies that are built for the cloud, really for against for log cloud native applications, and the majority of enterprises who just their step with legacy IT and application skill set, and they still come up to speed with the right thing to do. And to be honest, they're getting over the headache of Hadoop. So then, if we talk about cloud native world, there is a fascinating distinction between all the various projects, which have started to tackle streaming data. And there have been some major progress has been made some major progress there, Jim be delighted to point out some being one of them, and have been going into each one of those projects in detail as we go forward. The key point being that, first and foremost, the large majority of enterprises just don't
Tobias Macey
0:08:22
know what to do. And then within your specific offerings, there is the data Friedberg platform, which you're targeting for enterprise consumers. And then there's also the open source kernel of that in the form of swim OS. I'm wondering if you can provide some explanation as to what are the differentiating factors between those two products and the sort of decision points along when somebody might want to use one versus the other?
Simon Crosby
0:08:50
Yeah, let's cut it first at the distinction between the application layer and the infrastructure needed to run largest distribute data for pipeline. And so for swim all of the application layer stuff that then there's everything you need to build nap is entirely open source. Some of the capabilities that you want to run a large distributed data pipeline are proprietary. And that's really just because, you know, we're building a business around this, we plan to open source more and more features more and more features over time.
Tobias Macey
0:09:29
And then as far as the primary use cases that you are enabling with the swim platform, and some of the different ways that enterprise organizations are implementing it, what are some of the cases were using something other than swim, either the OS or the Data Fabric layer would be either impractical or intractable if they were trying to use more traditional approaches such as Hadoop, as you mentioned, or data warehouse and more batch oriented workflows?
Simon Crosby
0:09:58
So So let's start off describing what swim does, can it can I do that, that that might help our in our view, it's our job to build the pipeline, and indeed the model from the data. Okay, so swim, just once data, and from the data we will build, automatically build this typical data flow pipeline. And indeed, from that, we will build a model of arbitrarily interesting complexity, which allows us to solve some very interesting problems. Ok. So the swim perspective, starts with data. Because that's where our customers journey starts. They have lots and lots of data, they don't know what to do with it. And so the approach we take and swim is to allow the data to build the model. Now, you would naturally say that's impossible, in general, but requires is some oncology at the edge, which describes the dead, you could think of it as a schema, in fact, basically, to describe what data items mean, in some sort of useful sense to us as modelers. But then given data swim will build that model. So let me give you an example. Given a relatively simple ontology, for traffic, and traffic equipment, so position lights, the loops, and the road, the lights and so on, swim will build a model, which is a staple, digital twin is where for every sensor, every in every source of data, which is running in concurrently in some distributed fabric, and processes its own raw data and truly evolves, okay. So simply given that ontology, some knows how to build, stay faithful, concurrent, little things we call web engines, actually, I'm yeah, I'm using that term,
0:12:18
I guess the same as digital twin.
0:12:21
And these are concurrent things which are going to stay fluid process raw data and represent that in a meaningful way. And the cool thing about that is that each one of these little digital twins exists in a context, a real world context, that term is going to discover for us. So for example, a an intersection might have 60, to 80, sensors. So this notion of containment, but also, intersections are adjacent to other sections in the real world map. And so on. That notion of a Jason's is also real world relationship. And in swimming, this notion of a link allows us to express the real world relationships between these little digital twins. And linking in swim has this wonderful additional property, which is to allow us to express it essentially, as soon swim, there is never a pub, but there is up. And if something links to something else so filing to you, then it's like LinkedIn for things, I get to see the real time updates of in memory state still buy that digital twin. So digital twins, a link to a digital twins courtesy of real world relationships, such as containment or proximity. We can even do other relationships, like correlation,
0:14:05
also linked to each other, which allows them to share data.
0:14:09
And sharing data allows interesting computational properties to be derived. For example, we can learn and predict. Okay, so job one is to define the songs ology something goes and builds a graph, and a graph of digital twins, which is constructed entirely from the data. And then the linking happens as part of that. And that allows us to then construct interesting competitions.
0:14:45
Is that useful?
Tobias Macey
0:14:46
Yes, that's definitely helpful to get an idea of some of the use cases and some of the ways that the different concepts within swim work together to be able to build out to what a sort of conceptual architecture would be for an application that would utilize swim.
Simon Crosby
0:15:03
So the key thing here is I'm talking about an application bit just said, the application is to predict the future, the future traffic in a city, or what's going to happen in the traffic area right. Now, I could do that for a bunch of different cities, what I can tell you is I need a model for each city. And there are two ways to build a model. One way is I get a data scientist to have them build them, or maybe they train it and a whole bunch of other things. And I'm going to have to do this for every single city where I want to use this application. The other way to do it is to build the model from the data. And that's the approach. So what swim does is simply given the ontology, build these little digital twins, which are representatives of the real world things, get them to stay fully evolve, and then link to other things in, you know, to represent real world relationships. And then suddenly, hey, presto, you have built a large graph, which is effectively the model that you would have had to average a human build otherwise, right? So it's constructed in the sense that in any new city you go to this thing is just going to unbundle and just given a stream of data, it will build a model, which represents the things that are the sources of data and their physical relationships. That make sense.
Tobias Macey
0:16:38
Yeah, and I'm wondering if you can expand upon that, in terms of the type of workflow that a developer who is building an application on top of swim would go through as far as identifying what those ontology is, are and defining how the links will occur as the data streams into the different nodes in the swimming graph.
Simon Crosby
0:17:01
So the key point here is that we think that we will do, and then we can build, like 80% of a nap, okay, from the data. And that is we can find all of the big structural, red, all structural properties of relevance in the data, and then let the, the application builder drop in what they want to compute. And so let me try and express is slightly differently. Job, one, we believe is to build a model of the staple digital twins obby, which almost mirror their real world counterparts. So at all points in time, their job is to represent the real world, as faithfully and as close to real time as they can in a stable way, which is relevance to the problem at hand. Okay, so rather involved, so I'm going to have a red light, okay, something like that. And the first problem is to build this, the central digital twins, which are interlinked, which represent the real world being said, okay, and it's important to separate that, from the application layer component of what you want to compute from that. So frequently, we see people making the wrong decision that is hard, hard coupling, the notion of prediction, or learning or any other form analysis into the application in such a way that any change requires programming. And we think that that's wrong. So job one is to have this faithful representation of a real time world in which everything evolves its own state, whenever it's real world when evolves, and evolves stay pretty. And then the second component to that is, which of which we do on a separate timescale is to inject operators, which are going to then compute on the states of those things at the edge, right. So we have a model, which represents the relationships between things in the real world. It's attempting to evolve as close as possible to real time in relationship to the real world twin, and it's reflecting its links and so on. But the notion what you want to compute from it is separate from that and decoupled. And so the second step, which is an application, or building an application right here, right now, is to drop in an operator, which is going to compute a thing from that. So you might say, cool, I want every digital, every intersection to compute, you know, to be able to learn from its own behavior and predict. That's one thing, we might say, I want to compute the average wait time of every kind and see, that's another thing. So the key point here is that computing from these rapidly evolving world worldviews, is decoupled from the actual model of what's going on in that world and point in time. So it's from reflects that decoupling by allowing you to bind operators to the model whenever you want.
0:20:45
Okay,
0:20:46
bye whenever you want. I mean, you can write them in code and bits of job or whatever. But also, you can write them in blobs of JavaScript or Python, and dynamically insert them into a running model. Okay, so let me make that one concrete for you. I could have a deployed system, which is a model a deployed graph of digital twins, which are currently mirroring the state of Las Vegas. And data dynamically, a data scientist says, Let me compute the average wait time of red cars at these intersections, and drop said in as a blob of JavaScript attached to every digital twin for an intersection. That is what I mean by an application. And so we want to get to this point where the notional application is not something deeply hidden in somebody's, you know, notebook, or Jupiter notebook, or in some program his brain and they quit and wander off to the next startup 10 months ago, an application is what everyone or no right now grew up into a running model.
Tobias Macey
0:22:02
So the way that sounds like to me is that swim essentially acts as you deploy the infrastructure layer to ingest the data feeds from the sets of sensors, and then it will automatically create these digital twin objects to be able to have some digital manifestation of the real world so that you have a continuous stream of data and how that's interrelated. And then it sort of flips the order of operations in terms of how the data engineer and the data scientists might work together, where the way that most people are used to, you will ingest the data from these different sensors, bundle it up, and then hand it off to a data scientist to be able to do their analyses. They generate a model and then hand it back to the data engineer to say, Okay, go ahead and deploy this and then see what the outputs are where instead, the swim platform essentially acts as the delivery mechanism and the interactive environment for the data scientists to be able to experiment with the data, build a model, and then get it deployed on top of the continuously updating live stream of data, and then be able to have some real world interaction with those sensors, in real time, as they're doing that to be able to feed that back to say, okay, red cars are waiting 15% longer than other cars at these two intersections, and I want to be able to optimize our overall grid, and that will then feed back into the rest of the network to have some physical manifestation of the analysis that they're trying to perform to try and maybe optimizing all traffic.
Simon Crosby
0:23:39
So there are some consequences for that, first of all, the every algorithm has to compute stuff on the fly. So if you look at, you know, the kind of store and then analyze approach to Big Data Type learning, or training or anything else, you know, you have a little bit here, you don't. And so every algorithm that is part of swim, is coded in such a way is to continually process data. And that's fundamentally different to most frameworks. Okay, so for example,
0:24:19
the,
0:24:21
the Learn and predict cycle is what, you know, you mentioned training, and, and so on. And that's very interesting. But no train flies, that I collect and store some train data, and that it's complete and useful enough to try the model back and then hand back. You know, what, if it isn't, and so, in whom we don't do that, mean, we can if you want, if you have a bottle less no problem for us to be dead, too. But instead, in swim, the input vector, say to a prediction, I will say DNA is precisely the current state of the digital twins for some bunch things, right? Maybe the set of sensors in the neighborhood of the urban intersection. And so this is a continuously burying real world triggered scenario in which real data is fed through the algorithm, but is not stored anywhere. So everything is fundamentally streaming. So we assume that data streams continually and indeed, the output of every algorithm streams continually. So what you see when you compete in the average is the current average. Okay, when you see when you when you're looking for heavy hitters, the what you see as the current heavy hitters. All right. And so every algorithm has its streaming, twin, I guess. And and part of the art in the same context is reformulating the notion of of analysis into a streaming context. So that you never expect a complete answer, because there isn't one is just what I've seen until now. Okay, and what I've seen until now has been fed through the algorithm, this is the current answer. And so every algorithm, compute and streams. And so the notion of linking, which I described earlier for swim between digital twin say, applies also to these operators, which effectively would link to things they want to compute from, and then they would stream their results. Okay, so if you LinkedIn, you see a continued update. And for example, that stream could use to be could be used to feed a Kafka cathkin limitation, which would serve a bunch of applications, you know, the notion of streaming is, is pretty well understood. So we can feed other bits of the infrastructure very well. But fundamentally, everything is designed to stream,
Tobias Macey
0:27:21
it's definitely an interesting approach to the just overall workflow of how these analyses work. And one thing that I'm curious of is how data scientists and analysts have found working with this platform in terms of ways that they might be used to, you know, you're interested in, in what they scientists would view or how they view this,
Simon Crosby
0:27:45
to be honest, in general with surprise.
0:27:50
Our experience today has been largely with people who don't know what the heck they're doing in terms of data science. So they're trying to run an oil rig more efficiently they have, what about 10,000 sensors, and they want to make sure this thing isn't going to blow up. Okay? So tend to be heavily operationally focused folks. They're not that scientists, they never could afford one. And they don't understand the language of data science, or have the ability to build cloud based pipelines that you and I might be familiar with. So these are folks who effectively just want to do a better job, given this enormous stream of data they have, they believe they have something in the data, they don't know what that might be. But they came to go and see. Okay. And so those are the folks who spent most of our time with, I'll give you a funny example, if you'd like char man.
Tobias Macey
0:29:00
illustrative,
Simon Crosby
0:29:02
we work with a manufacturer of aircraft.
0:29:05
And they have very large number of RFID tag parts, and equipment to and so if you know anything about RFID, you know, it's pretty useless stuff is built from about 10 years ago, 20 years ago. And so what they were doing is from about 2000, readers, again, about 10,000 reads a second. And each one these read is simply being written into an oracle database, at the end of the day that try and reconcile the soul with what whatever parts have, and wherever the thing is, and so on. And this whom solution to this is entirely different, it gives you a good idea of why we care about modeling data, or thinking about data differently. We simply built a digital twin for every tag, the first time it's seen, we create one. And then they know they have been in for a long time, they just expire. And whenever a reader sees attack, it simply says, Hey, I saw you. And this was the signal strength. Now, because tanks get seen by multiple readers, the each digital 12 attack does the obvious thing. It triangulate from the readers. Okay, so it learns the attenuation different parts of the plot. It's very simple initially, it that's the word learn there is a rather stretch to British straightforward calculation, and then suddenly can work out where it is in three space. So instead of an oracle database, or a database full of tag berries, and lots and lots of post processing, you know, but the Kepler Raspberry Pi's and each one NNE, Raspberry Pi's, you know, have millions of these tanks running, and then you can ask any one of them where it is. Okay, and you then you can do even more, you can say, hey, show me all the things within three meters of this tech, okay? And that allows you to see components being put together into real physical objects, right? So as a fuse, ours gets built up the engine or whatever it is. And so a problem, which was tons of infrastructure, and tons of taghreed got tend to Raspberry Pi's with stuff, which kind of self organized and into a phone, which could feed real time visualization and controls around what what bits of infrastructure were.
0:31:52
Okay. Now, that
0:31:54
was transformative for this outfit, which was, which quite literally had for tackling the problem this way.
Tobias Macey
0:32:02
Does that make sense? Yeah, that's definitely very useful example of how this technology can flip the overall order of operations and just the overall capabilities of an organization to be able to answer useful questions. And the idea of going from, as you said, an Oracle Database full of all these just rows and rows of records of this tag, read at this point in this location. And then being able to actually get something meaningful out of it. As far as this part is in this location in the overall reference space of the warehouse is definitely transformative, and probably gave them weeks or months worth of additional time in terms of lead time for being able to predict problems or identify areas for potential optimization.
Simon Crosby
0:32:47
Yeah, I think we said them $2 million a year. Let me tell you what, from this tale come two interesting things. First of all, if you show up at customer service running on Raspberry Pi, you can charge them a million bucks. Okay, that's less than one lesson too is that the volume of the data is not relevant, or not related to the value of the insight. Okay. I mentioned traffic earlier. In the city of Las Vegas, we get about 20 1516 terabytes per day of the traffic infrastructure. And every intersection, every digital twin, every intersection in the city predicts two minutes into future, okay. And those insights are sold in an API in Azure, to customers like Audi and Uber and Lyft, and whatever else, okay. Now, that's a ton of data, okay, it's just you couldn't even think of where to put in your cloud. But the manual, the inside is relatively low. This is the total amount of money Agni extract from Uber per month per intersection is low. Alright, by the way, all this stuff is open source, you can go grab it, and play and hopefully make your city better. So what from that you can go there, it's not a high enough value for me to do anything other than say, go grab it and run. So vast amounts of data and relatively important, but not commercially relevant value.
Tobias Macey
0:34:35
And another aspect of that case, in particular, is that despite this volume of data, it might be interesting for being able to do historical analyses. But in terms of the actual real world utility, it has a distinct expiration period where you have no real interest in the sensor data as it existed an hour ago, because that has no particular relevance on your current state of the world and what you're trying to do with it at this point in time,
Simon Crosby
0:35:03
yeah, you have historical interest in the sense of wanting to know if your predictions were right, or wanting to know about traffic engineering purposes, which runs on a slower time scale. So some form bucketing or whatever assemble, terse followed, recording is useful. And sure, that easy. But you certainly did not want to record it, there were no dead rate.
Tobias Macey
0:35:30
And then going back to the other question I had earlier, when we were talking about the workflow of an analyst, or a data scientist pushing out their analyses live to these digital twins and potentially having some real world impact. I'm curious if the swim platform has some concept of a dry run mode, where you can deploy this analysis and see what the output of it is without it and see maybe what impact it would have without it actually manifesting in the real world for cases where you want to ensure that you're not accidentally introducing error or potentially having a dangerous outcome, particularly in the case that you were mentioning of an oil and gas rig.
Simon Crosby
0:36:12
Yeah, so I'm with the 1% XE. Everything we've done thus far has been open loop in the sense that we're informing another human or another application, but we're not directly controlling the structure. and the value of a dry run would be enormous, you can imagine in those scenarios, but thus far, we don't have any use cases that we can report of using some for direct control. We do have use cases where on a second by second basis, we are predicting whether machines are going to make an error they make as they build PCBs, for servers, and so on. Then again, what you're doing is you're calling from ladies come over and fix the machine, you're not, you know, you're not trying to change the way the machine bags.
Tobias Macey
0:37:06
And now digging a bit deeper into the actual implementation of swim, I'm wondering if you can talk through how the actual system itself is architected. And some of the ways that it has evolved as you have worked with different partners to deploy it into real world environments and get feedback from them, and how that has affected the overall direction of the product roadmap.
Simon Crosby
0:37:29
So swim is a couple of megabytes of job extensions. Okay? So it's extremely lean, we tend to deploy in containers using the growl VM. To very small, we can run in, you know, probably 100 megabytes or so. And so, people tend to think of when people tend to think of edge, they tend to think of branding in the educated ways or things, we don't really think of Ag that way. And so an important part of defining edge, as far as we're concerned, is simply gaining access to streaming data, we don't really care where it is, but to me small enough to get on limited amounts of compute towards the physical edge. And the, you know, the product has evolved in the sense that, Originally, it was a way of building applications for the edge and you'd sit down, write them in Java, and so on.
0:38:34
laterally, this ability to simply let
0:38:39
let the app application data or let the data build the app, or most of the app can bonus in response
0:38:46
to customer needs.
0:38:49
But swim is deployed, typically in containers, and for that we have in the current release relied very heavy on the Azure IoT edge framework. And that is magical, to be quite honest, because we can rely on Mac soft machinery to deal with all of the painful bits of deployment and lifecycle management for the code base and the application as it runs. These are not things we are really focused on what we're trying to do is build a capability which will respond to data and do the right thing for the application developer. And so we are fully published in the Azure IoT Hub, and you can download this and get going and managers through a cycle that way. And so in several use cases, now, what we're doing is we are use to feed fast time skill, insights at the physical edge, we are labeling data and then dropping it into Azure AD pls, Gen two, and feeding insights into applications built in Power BI. Okay, so it just for the sake of machinery, you know, using the Azure framework for management of the IoT edge, by the way, I think IoT edge is too bad, the worst possible name you could ever pick, because all you want is a thing to manage the lifecycle management of a capability, which is going to deal with fast data. Whether it's at the physical edge or not, is immaterial. But that, but that's basically what we've been doing is relying on Microsoft's fabulous Lifecycle Management Framework for that, plugged into the IoT Hub, and all the Azure IoT small as your services generally, for back end things which enterprises love.
Tobias Macey
0:41:00
Then another element of what we're discussing in the use case, examples that you were describing, particularly, for instance, with the traffic intersections, is the idea of discover ability and routing between these digital twins, as far as how they identify the cardinality of which twins are useful to communicate with and establishing those links, and also at the networking layer, how they handle network failures in terms of communication and ensuring that if there is some sort of fault that they're able to recover from it,
Simon Crosby
0:41:36
symbols, let's talk about two layers. One is the app layer. And the other one is the infrastructure, which is going to run this effective is distributed graph.
0:41:45
And so assume is going to build this graph for us
0:41:49
from the data. What that means is the digital twin, by the way, we technically call these web agents, these little web agents are going to be distributed somewhere a fabric of physical instances, and they may be widely geographically
0:42:06
distributed. And
0:42:08
so there is a need, nonetheless, at the application layer for things which are related in some way linked physically or, you know, in some other way to be able to link to each other that says to
0:42:23
me couldn't have a sub. And so links
0:42:27
require that object, which are the digital twins have the ability to inspect
0:42:33
each other's data,
0:42:34
right, their members, and of course, is something is running on the other side of the planet, and you're linked to it, how on earth is that going to work. So we're all familiar with object oriented languages and objects in one address space, that's pretty easy. We know what an
0:42:50
object handle or an object
0:42:51
reference or a pointer or whatever we get it, but when these things distribute, that's hot, and so in swim with your an application program, where you will simply use object references, but these resolve to your eyes. So in practice, at runtime, the linking is when I link to you, I'll link to your eye. And that link,
0:43:17
will it's resolved by swim
0:43:19
enables a continuous stream of updates to flow from you to me. And if we happen to be on different instances that is running in different address spaces, then there will be over a mash of all my direct web sockets connection between your instance in mind. And so in any swim deployment, all instances are interlinked. So each link to each other using a single web sockets connection, and then these links permit the flow of information between linked digital twins. And what happens is, whenever in a change in the in memory, state of a linked, you know, digital twin happens, what happens is that it's instance, then streams to every other linked object and update to the state for that thing, right. So what are quite what's required is, in effect, a streaming update to Jason, but because of, we're going to record our model in some form of like JSON state or whatever, we would not need to be able to update little bits of it as things change until we use a protocol called warp for that. And that's a swim capability, which we've open sourced. And what that really does is bring streaming to Jason right, streaming updates two parts of a Jason model. And then every instance in swim maintains its own view of the whole model. So as things streaming, the local view of the model is change. But the view of the of the via the world is very much one of a consistency model based on whatever happens to be executing locally and whatever needs to view state certain, eventually consistent dare model, which every node eventually learns the entire thing.
Tobias Macey
0:45:22
And generally, eventually here means you know, a link, so a link away from real time, right, so links delay away from real time. And then the other aspect of the platform is the state fullness of the computation. And as you're saying that that state is eventually consistent dependent on the communication delay between the different nodes within the context graph. And then in terms of data durability, one thing I'm curious of is the length of state, or sort of the overall buffer that is available, I'm guessing is largely dependent on where it happens to be deployed, and what the physical capabilities are of the particular node. And then also, as far as persisting that data for maybe historical analysis, my guess is that that relies on distributing the data to some other system for long term storage. I'm just wondering what the overall sort of pattern or paradigm is for people who want to be able to have that capability?
Simon Crosby
0:46:24
Oh, this is a great question. So in general, the move going from some horrific raw data form on the wire from this the original physical thing to you know, something much more efficient and meaningful in memory, and generally much more concise, so we get a whole ton of dead redaction am I. And so the system is focused on streaming, we don't stop you storing your original data, if you want to, you might just have to discover or whatever the key thing into them is, we don't do that on the hard path. Okay, so things change this day to memory, and maybe compute on that. And that's what they do first and foremost, and then we lately throw things to disk, because disks happens slowly relative to compute. And so typically, what we end up storing is the semantic state of the context graph, as you put it, not the original data.
0:47:23
That is, for example, in traffic world,
0:47:26
you know, we store things like this slide turn red at this particular time, not the voltage on all the registers in the light, and to get massive data reduction. And that form of data is very amenable to storage in the cloud, say or somewhere else. And it's even affordable at, at reasonable rates.
0:47:50
So the key thing for for swimming storage is
0:47:53
you're going to remember as much as you want as much as you have space for locally. And then storage in general, is on the is not on a hot pot, it's not on the computer and string bar and January beginning huge data reductions for every step up the graph we make. So for example, if I go from you know, all the states have all the traffic centers to predictions, then I've made a very substantial reduction in the data remand anyway, right. So as you move up this computational graph, you reduce the amount of data you're going to have to store. And it's up to you really pick what you want to what you want. So
Tobias Macey
0:48:39
in terms of your overall experience, working as the CTO of this organization and shepherding the product direction and the capabilities of this system, I'm wondering what you have found to be some of the most challenging aspects, both from the technical and business sides, and some of the most useful or interesting or unexpected lessons that you've learned in the process.
Simon Crosby
0:49:03
So what's hard is that the real world is not the cloud native world. So we've all seen tablets, examples of Netflix, and Amazon and everybody else doing cool things with data they do. But you know, if you're an oil company, and you have a regarded See, you just don't know how to do this. So, you know, we can come at this, with whatever skill sets we have, what we find is that the real world large enterprises have today are still acres behind the cloud native folk. And that's a challenge. Okay, so getting to be able to understand what they need, because they still have lots of assets, which is generating tons of data is very hard. Second, this notion of edge is continually confusing. And I mentioned previously that, that I would never I've chosen IOTHS, for example, that as your name, because it's not about IoT, or maybe it is, but you may give you two examples. One is traffic lights, say physical things, it's pretty obvious that you're, what the notion of edge is its physical edge. But the other one is this, we build a real time model for millions 10s of millions of headsets for a large mobile carrier in memory, and devolve all the time, right in response to continue to receive signals from these devices,
0:50:38
there is no age,
0:50:40
that is its data and drives over the internet. And we have to figure out where the digital twin for that thing is, and evolve it in real time. Okay, and there, you know, there is no concept of of a of a network to be no or physical edge and traveling over them. We just have to make decisions on the fly and learn and update this month.
0:51:06
So for me, edges, the following thing, edge is stable.
0:51:13
And
0:51:15
cloud is all about rest. Okay, so what I'd say is the fundamental difference between the notion of edge and the notion of cloud that I would like to see, broadly understood is that Where's rest and databases made the cloud very successful, in order to be successful with, you know, this boundless streaming data, state fullness is fundamental, which means rest goes up the door. And we have to move to a model, which is streaming based and staple computation.
Tobias Macey
0:51:50
And then in terms of the future direction, both from the technical and business perspective, I'm wondering what you have planned for both the enterprise product for swim.ai, as well as the open source kernel in the form of CMOS.
Simon Crosby
0:52:06
From an open source perspective, we,
0:52:08
you know, we don't have the advantage of having come up at LinkedIn or something we built it built in us at scale, and be coming out of the startup? Well, we think we built is something which is a phenomenal value. And we're seeing that grow. And our intention is to continually feed their community as much as you can take. And we're just getting more and more stuff ready for open sourcing and ending up.
0:52:36
So we want to see our community
0:52:40
go and explore new use cases for using this stuff, and are totally dedicated to empowering our community. From a commercial perspective, we are focused on honor world, which is edge and moment you said people they tend to get an idea physical edge or something in their heads. And then you know, very quickly, you can get put in a bucket of IoT, I gave an example of say, building a model in real time in AWS for you know, a mobile customer, our intention is to continue to push the bounds of what edge means and and to enable people to build stream pipelines for massive amounts of data easily without complexity, and without the skill set required to invest in these traditionally, fairly heavyweight pipeline components such as beam and flank and, and so on,
0:53:46
to
0:53:47
to enable people to get insights cheaply, and to make the problem of dealing
0:53:51
with new insights from data very easy to solve.
Tobias Macey
0:53:56
And are there any other aspects of your work on swimming is the space of streaming data and digital twins that we didn't discuss yet that you'd like to cover? Before we close out the show?
Simon Crosby
0:54:08
I think we've done pretty good job, you know, I think there are a bunch of parallel efforts. And that's all goodness, that is one of the hardest things has been to get this notion of stapling this more broadly accepted. And I see the function like vendor out there pushing their idea, this a staple functions as service. And really, these are staple amateurs. And there are others out there too. So for me, step number one is to get people to realize that if we're going to take this data that rest and databases are going to kill us, okay? That is there is so much data and the rates are so high that you simply cannot afford to use a stateless paradigm for processing you have to do stay fully. Because, you know, forgetting context every time and then look it up. It's just too expensive.
Tobias Macey
0:55:08
For anybody who wants to follow along with you and get in touch and keeping track of what you're up to. I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today?
Simon Crosby
0:55:26
Well, I think, I mean, there isn't much tooling to be perfect out there a bunch of really fabulous open source code bases and experts in their use. But that's far from tooling. And then there is I guess, an extension of the Power BI downwards. Were, which is something like the monster Excel spreadsheet world, right? So you find all these folks who are pushing that kind of you no end user model of data, doing great things, but leaving a huge gap between the consumer of the insight and the data itself is assuming the data is already there in some good form and can be put into spiritual view, whatever it happens to be. So there's this huge gap in the middle, which is how do we build the model? What does the model tell us? Just off the bat, how do we do this reconstructive Lee in large numbers situations? And then how do we dynamically insert operators which are going to compute useful things for us on the fly in writing models?
Tobias Macey
0:56:44
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on the swim platform. It's definitely a very interesting approach to data management and analytics, and I look forward to seeing the direction that you take it in the future. So I appreciate your time on that. I hope you enjoy the rest of your day.
Simon Crosby
0:57:01
Thanks very much. You've been great be
Tobias Macey
0:57:03
Thank you for listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

Building A Community For Data Professionals at Data Council - Episode 96

Summary

Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your original reason for focusing your efforts on fostering a community of data engineers?
    • What was the state of recognition in the industry for that role at the time that you began your efforts?
  • The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?
    • How has the community itself changed and grown over the past few years?
  • Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data?
  • Where do you draw inspiration and direction for how to manage such a large and distributed community?
    • What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?
  • What are some ways that you have been surprised or delighted in your interactions with the data community?
  • How do you approach sustainability of the Data Council community and the organization itself?
  • The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction?
  • In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission?
  • You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?
    • What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?
  • What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?
    • What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses?
  • What are the characteristics of a data business that you look at when evaluating a potential investment?
  • What are some of the current industry trends that you are most excited by?
    • What are some that you find concerning?
  • What are your goals and plans for the future of Data Council?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at lead node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And listen, I'm sure you work for a data driven company. Who doesn't these days? Does your company use Amazon redshift? Have you ever grown over slow queries or just afraid that Amazon redshift is going to fall over at some point? Well, you've got to talk to the folks [email protected] they have built the missing Amazon redshift console. It's an amazing analytics product for data engineers to find and rewrite slow queries. And it gives actionable recommendations to optimize data pipelines. We work Postmates and medium or just a few of their customers. Go to data engineering podcast.com slash intermix today and use promo code DEP at sign up to get a $50 discount. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Caribbean global intelligence data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, and data Council in Barcelona. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Pete Soderling about his work to build and grow a community for data professionals with the data council conferences and meetups as well as his experiences as an investor in data oriented companies. And full disclosure that data Council and clubhouse are both previous sponsors of the podcast and clubhouse is one of Pete's companies that he's invested in. So Pete, can you just start by introducing yourself?
Pete Soderling
0:02:44
Yeah, thanks. Thanks for the opportunity to be here, Tobias. I'm Pete Soderling, as you mentioned, and I'm a serial entrepreneur. I'm also a software engineer from the first internet bubble. And I'm super passionate about making the world a better place for other developers. And you remember it
Tobias Macey
0:02:59
you first get involved in the area of data management?
Pete Soderling
0:03:02
Yeah, I think, funnily enough, the thing that jumps out at me is how excited I was when I was a early developer, very young in my career, I discovered this book database designed for Mere Mortals. And I think I read it over my my holiday vacation one year, and I was sort of amazed at myself at how interesting such a potentially dry topic could be. So that was a, an early indicator. I think fast forward. My first company, actually my second company in 2009, that I started was a company called strata security. And originally, we were building what we thought was a API firewall for a web based API's. But it quickly morphed into a platform that's secured and offered premium data from providers like Bloomberg or Garmin, or companies that had lots of interesting proprietary data sets. And our vision was to become essentially like the electricity in between that data provider and their API. And the consumers were consuming that data to the API so that we could offer basically metered billing based on how much data was consumed. So I guess that was my first significant interest as an entrepreneur in the data space back about 10 years or so ago.
Tobias Macey
0:04:20
And now you have become an investor in data oriented companies, you've also started the series of conferences that were previously known as the data edge confident have been rebranded as data Council. And that all started with your work and founding haka Labs is a community space for people working in the data engineering area. And I'm curious what your original reason was for focusing your efforts in that direction and focusing particularly on data engineers?
Pete Soderling
0:04:51
Yeah, I guess it's, it's gets to the core a bit of who I am. And as I've looked back over my shoulder, as both an engineer and a father, I guess what I've realized, which actually, to some extent, surprised me is that all of the companies I've started, have all had software engineers, as end users, or customers. And I discovered that I really am significantly passionate about getting developers together, helping them share knowledge, and helping them with better tooling, and essentially just making the world awesome for them. And it's become a core mission of everything I do. And I think it basically is infused in all these different opportunities that I'm pursuing. Now. For instance, one of my goals is to help 1000 engineers start companies, but not it gets to some of the startup stuff, which is essentially a side project that we can, we can talk about later. But specifically, as it relates to hacker labs, hacker labs was originally started in 2010, to become a community for software engineers. And originally, we thought that we were solving the engineer recruiting problem. So we had various ideas and products that we tested, rounding, introducing engineers to companies in a trusted way. And originally, that was largely for all software engineers everywhere. And our plan was to turn it into a digital platform. So it was going to have social network dynamics where we were connecting engineers together. And those engineers would help each other find the best jobs. So that was had, you know, very much was sort of in the social networking world. But one of our marketing ideas was we wanted to build a series of events surrounding the digital platform, so that we could essentially lead users from the the community in our events, and introduce them to the digital platform, which was the main goal. And one of the niche areas that we wanted to focus on was data, because data science was becoming a hot thing on data engineering was even more nascent. But I sort of saw the writing on the wall and saw that data engineering was going to be required to support all the interesting data science goodness that was being talked about. And really, I was of interest to business. And so you know, pure the data meetups that we started, were essentially a marketing channel for this other startup that I was working on. And ultimately, that startup didn't work. And the product didn't succeed, which is often the case with network based products, especially in the digital world. But I realized that we had built this brand, surrounding awesome content for software engineers, data engineers through our meetups that we had started. And we fell back on that community, and figured out that there must be another way to monetize and to keep the business going. Because I was so passionate about working with engineers and continuing to build this this community that we had seated. And engineers love the brand. They love the events that that we're running, they loved our commitment to deeply technical content. And so one thing led to another and ultimately, those meetups grew into what data Council is today.
Tobias Macey
0:07:48
And you mentioned that when you first began on this journey, it was in the 2010 timeframe. And as you referred to data engineering as a discipline was not very widely be recognized. And I'm wondering how you have seen the overall evolution of that role and responsibility? And what the state of the industry and what the types of recognition and responsibilities were for data engineers at that time.
Pete Soderling
0:08:16
Yeah, you know, data engineering was just not really a thing at the time. And only the largest tech company is Google and Facebook even had the notion of sort of the data engineering concept. And but I guess, you know, what I've learned from, from watching engineering at those companies is that, because of their scale, they discover things more quickly and more often, or earlier than, than other folks tend to. And so I think that was just interesting, you know, leading indicator and so I, I felt like it was going to get bigger and bigger. But yeah, there was no, I don't even know if if Google necessarily had a data engineering job title at that time. So you know, that was just very early in the in the space. And I think we've seen it develop a lot since since then. And it's not just in the title. But I think, you know, we saw early on that data science was a thing and was going to be a bigger thing. Data engineering was required to to have the data scientists and the quants to do awesome stuff in the first place. And then there's also the analysts who are trying to consume the data sets, oftentimes in slightly different ways, and the data science scientists, so I think early on, we saw that these three types of roles were super fundamental and foundational to building the future of data infrastructure and business insights and data driven products. And so even though we started off running the data engineering meetup, which I think we're still known for, we pretty quickly through the conference embraced these other two disciplines as well, because I think the interplay of how these types of people work together inside organizations, is where the really interesting stuff is. And so you, you know, as these these job descriptions, the themes in these job descriptions and sort of how they unite, and how they work together on projects is fascinating. And so, through data Council, our goal has been to further the industry by getting these folks in the same room around content that they all care about. And sometimes it's teaching a data engineer a little bit more about data science, because that's what makes them stronger and better able to operate on a on a multifunctional team. And sometimes it's the data scientists getting a little bit better at some of the engineering disciplines and understanding more what the data engineers have to go through. So I think that looking at this world in a in a cohesive way, especially across these three roles, has really benefited us and made the community event very unique and strong. And now I should say that, I think the next phase of that in terms of team organization, and especially in terms of our vision with data Council is we're now embracing product managers into that group as well. I think that, you know, there's the stack, we sort of see this stuff, lack of data being data infrastructure on the bottom, then data engineering and pipelines, then data science and algorithms, then data analytics on top of that. And finally, there's the the AI features and the people that are weaponized this entire stock into AI products and data driven features. And I think the final icing on the cake, if you will, is creating space for data oriented product managers, because, you know, it used to be that maybe you think of a data Product Manager is like working for Oracle or being in charge of shipping a database. But that's, you know, that's sort of a bit older school at this point. And there's all kinds of other interesting applications of data infrastructure and data technologies that are not directly in the database world, where real world product managers in the modern era, I'm sort of need to understand how to interact with this stack, and then how to build data tooling, whether its internal tooling for developers, or customer consumer facing, beat. So I think embracing the product manager at the top of this dock has been super helpful for our community as well.
Tobias Macey
0:12:07
And I find it interesting that you are bringing in the product manager, because there has long been a division in particularly with technical disciplines where you have historically the software engineer who is at odds with the systems administrator, and then recently, the data scientist or data analyst who is at odds with the data engineer. But there has been an increasing alignment across the business case, and less so in terms of the technical divisions. And I'm curious what your view is in terms of how the product manager fits in that overall shift in perspective, and what their responsibility is within an organizational structure to help with the overall alignment in terms of requirements and vision and product goals between those different technical disciplines?
Pete Soderling
0:12:59
Yeah, well, hey, I think I think this is just a super This is a My question is a microcosm of what's really happening in the engineering world, because I think software engineers at the core at the central location are actually eating disciplines and roles that used to be sort of beneath them and above them. So again, I'm sort of sticky thinking in terms of this vertical stock. But, you know, most modern tech companies don't have DBS. Because the software engineers now the DPA, and many companies don't have designated infrastructure teams, because a software engineer is responsible for their own infrastructure. And some of that is because of cloud or the dynamics, but sort of what's happening is, you know, at its core, the engineer is eating the world. And it's bigger than just software in the world engineers in the world. And so I think the the absorption of some of these older roles into what's now considered core software engineering has happened below, and I think it's happening above. So I think, some product management is collapsing into the world of the software engineer, or engineers are sort of lathering up into product management. And I think part of that is the nature of these deeply technical products that we're building. So I think many engineers make awesome product managers. I mean, perhaps they have to step away and you know, be willing not to write as much code anymore. But because they have an intrinsic understanding of the way these systems are built. I think engineers are just sort of like reaching and, and absorbing a lot of these other roles. And so some of the best product managers that, you know, I think we've seen have been x software engineers. So I just think that there's a real emerging, this is just a larger perception that I have of the world, into the software engineering related disciplines. And I think it's actually not a far leap, to sort of see how, you know, an engineer, a product manager, who's informed with an engineering discipline is super effective in that role. So I just think this is a broader story that we're seeing overall, if that makes sense.
Tobias Macey
0:15:02
Yeah, I definitely agree that there has been a much broader definition of what the responsibilities are for any given engineer, because of the fact that there are a lot more abstractions for us to build off of, and so it empower his engineers to be able to actually have a much greater impact with a similar amount of actual effort, because of the fact that they don't have to worry about everything from the silicon up to the presentation layer, because there are so many different useful pre built capabilities that they can take advantage of and think more in terms of the systems rather than the individual bits and bytes. Yeah, exactly. And in terms of your overall efforts for community building, and community management, there are a number of different sort of focal points for communities to grow around that happen because of different shared interests or shared history. So there are programming language communities, their communities focused on disciplines, such as, you know, front end development, or product management or business. And I'm wondering what your experience has been in terms of how to orient a community focus along the axis of data, given that it can in some ways be fairly nebulous as to what are the common principles in data because there's so many different ways to come at it, and so many different formats that data can take?
Pete Soderling
0:16:31
Yeah, I think the core, you know, one of the core values for us, and I don't know if this is necessarily specific to date or not, but it's just openness. And I think especially, you know, we're, we see ourselves as much, much more than just a conference series, and we use the word community, in our team and at our events, and just to describe what we're doing dozens and dozens of times a week. And so, yeah, I think the community bond and the mentality for our brand is super high. I think that, you know, there's a, there's also an open source sort of commitment. And I think that's a mentality, I think that's a that's a coding, discipline, and style. And I think that, you know, sharing knowledge is just super important in any of these technical fields. And so, engineers are super thirsty for knowledge. And we see ourselves as being a connecting point where engineers and data scientists can come and share knowledge with each other. I think especially maybe that's a little bit accelerated, in the case of data science or AI research, because these things are changing so fast. And so there is a little bit of an accelerator in terms of the way that we're able to see our community grow and the interest in this space, because so much of this technical stuff is changing quickly. And, you know, engineers need a trusted source to come to where they can find and get surface the best, most interesting research and most interesting open source tools. So we've capitalized on that, and, and we try and be an one on one, and we're sort of a media company. You know, on the other hand, we're an events business. On the other hand, we're a community. But we're sort of putting all these two things together in a way that we think benefits careers for engineers, and it enables them to level up in their careers and make them smarter and get better jobs and meet awesome people. So really, all in all, you know, we see ourselves as building this building this awesome talent community, around data and AI, worldwide. And we think we're in a super unique position to do that and
Tobias Macey
0:18:33
succeed at it. community building can be fairly challenging because of the fact that you have so many different disparate experiences and opinions coming together. And sometimes it can work out great. Sometimes you can have issues that come up just due to natural tensions between people interacting in a similar space. And I'm wondering, what you have been drawing on, for instance, and reference for how to foster and grow a healthy community, and any interesting or challenging or unexpected aspects of that overall experience of community management that you've encountered in the process?
Pete Soderling
0:19:13
Yeah, I think it's an awesome question. Because any company that embraces community, to some degree embraces perhaps somewhat of a wild wild west. And I think some companies and brands manage that very heavily top down, and they want to, and they have the resources to, and they're able to some others, I think, let the community sprawl. And, you know, in our particular case, because we're a tiny startup, I used to say that we're three people into PayPal accounts, I'm running events all over the world, you know, even though we're just a touch bigger than that now, not much. But we have 18 meetups all over the world and forming conferences from San Francisco to Singapore. So I think the only way that we've been able to do that, and just to be clear, like we're up for profit business, but I think that's one of the other things that makes us super unique is that, yes, we're for profit. But at the same time, we're embracing a highly principled notion of community. And we use lots and lots of volunteers in order to help you know further that message worldwide, because we can't afford to hire community managers in every single city that we want to operate in. So so that that's one thing. And I guess, for us, we've just had to embrace kind of the the wild nature of what it means to scale a community worldwide and deal with the ups and downs and challenges that come with that. And, of course, there's some brand risk. And there's, you know, other sorts of frustrations, sometimes working with volunteers, but I guess my inspiration, you know, specifically in this was really through through 500 startups, and I went on geeks on a plane back in 2012, I believe. And when I saw the way that 500 startups, which is a startup accelerator in San Francisco, was building community, all around the world, basically one plane at a time. And I saw how kind of wild and crazy that community was, I sort of learned, like the opportunity and the challenge of building community that way. And I think the main thing, you know, if you can embrace the chaos, and if your business situation forces you to embrace the chaos order to scale, I think the main way that you keep that in line is you just have a few really big core values that you talk about, and you emphasize a lot, because basically, the community has to sort of manage itself against those values. And you know, this, this isn't like a detailed, like, heavy takedown process, because you just can't in that scenario. So I think the most important thing is that the community understands the ethos of what you stand for. And that's why with data Council, you know, there's a couple things I already mentioned open source, that's super important to us. And we're always looking for new ways to lift up open source, and to encourage engineers to submit their open source projects for us to promote them. we prioritize open source talks at our conference. You know, that's just one one thing. I think the other thing for data Council is that we've committed to not be an over sponsored brand. This can make it hard economically for us to be able to grow and, and build the to hire the team that we want to sometimes, but we're very careful about the way we integrate partners and sponsors into our events. And we don't want to be, you know, what we see as some of our competitors being sort of over saturated and swarming with salespeople. So there's a few like, Hi thing, I guess the other thing that that's super important for us is we're just deeply, deeply committed to deeply technical content. We screen all of our talks, and we're uncompromising in the in the speakers that we put on stage. And I think all of these things resonate with engineers like I'm, I'm an engineer. And so I know engineers think and I think these three things have differentiated us from a lot of the other conferences and, and events out there, we realized that there was space for this differentiation. And I think all these things resonate with engineers. And now it makes engineers and data scientists around the world want to raise their hands and help us launch meetups, we were in Singapore. Last month, we launched our first data data council conference there, which was amazing. And the whole community came between Australia and India and the whole region, Southeast Asia, they were there. And we left with three or four new meetups in different cities, because people came to the conference saw what we stood for, saw, they were sitting next to awesome people and awesome engineers. And it wasn't just a bunch of suits at a data conference. And they wanted to go home and take a slice of data console back to their city. And so we support them in creating meetups, and we connect them to our community, and we help them find speakers. And it's just been amazing to really see that thrive. And I think like I said, the main the main thing is just knowing the the core ethos of what you stand for. And even in the crazy times, just being consistent about the way you can you communicate that to the community, letting the community run with it and see what happens. And sometimes it's it's messy, and sometimes it's awesome. And but you know, it's a it's an awesome experiment. And I just think it's incredible that a small company like us can have global reach. And it's only because of the awesome volunteers, community organizers, meetup organizers, track host for our conference that we've been able to suck into this into this orbit. And we just want to make the world a better place for them. And they've been super, super helpful and, and kind and supporting us, and we couldn't have done it without them. So it's been an awesome experiment. And, you know, we're continuing to push forward with that model.
Tobias Macey
0:24:33
With so many different participants and the global nature of the community that you're engaging with, there's a lot of opportunity for different surprises to happen. And I'm wondering what have been some of the most interesting or exciting to paraphrase Bob Ross happy accidents that you have encountered in your overall engagement with the community? Hmm,
Pete Soderling
0:24:57
I guess, this wasn't totally unsurprising. But I just love to sort of surround myself with with geeks, you know, geeks have always been my people. And even when I stopped writing code actively, I just gravitated towards software engineers, and obviously, which is why I'm sort of, you know, I do what I do. And it's what makes me tick. I guess one of the interesting thing through running a conference like this is, you get to meet such awesome startups. And there's so many incredible startups being started outside of the valley. You know, I lived in New York City for many years, and I lived in in San Francisco for many years. And now I spend most of my time in Wyoming. So I'm relatively off the map. And one way of thinking but in the other way, you know, as the center of this conference, we just meet so many awesome engineers and startups all over the globe. And I'm really happy to see such awesome ideas start to spring up from other startup ecosystem. So, you know, I don't believe that all the engineering talent should be focused in Silicon Valley, even though it's easy to go there, learn a ton really benefit from that better from the community benefit from the the big companies with scale. But ultimately, I think, you know, not everyone is going to live in the Bay Area, I hope they don't, because it's already getting a little messy there. But I just want to see, you know, all of these things sort of democratize and distributed, both in terms of software engineering, and then the engineers that start these awesome startups. And so, you know, the the ease with which I'm able to meet and see new startups around the globe, to the data, the data council community, I think it's been a real bright spot in that. And I don't know if it was necessarily a surprise, but maybe it's been a surprise to me at how quickly it's happened.
Tobias Macey
0:26:34
So one of the other components to managing communities is the idea of sustainability and the overall approach to governance and management. And I'm wondering both for the larger community aspect, as well as for the conferences and meetup events, how you approach sustainability to ensure that there is some longevity and continuity in the overall growth and interaction with the community?
Pete Soderling
0:27:02
Yeah, I think I think the main thing, you know, this gets back to another core tenet of sort of the psychographic of a software engineer, software engineers need to know how things work. And that's sort of the core of our mentality in building things. We want to know how things work, if we didn't build it ourselves. We prefer to like, rip off the covers and understand how it works. And you know, to be honest, part of the way that for instance, we select talks at our conference, you know, I think this applies to and we're learning to get better about. I mean, I think as a as a value, we believe in openness and transparency. In our company, I think externally facing, we're getting better about how we actually enable that with the community. But for instance, for our next data council conference that's coming up in New York, and in November, we've published all of our track summaries on GitHub, and we've opened that up to the community where they can contribute ideas, questions, maybe even speakers, theme sub themes, etc. And I think just the nature that, you know, we have the culture to start to plan, our events like this in the open, I think, brings a lot more transparency. And then I guess the other thing about a community that's just sort of inherent, I think, in a well run community, is the amount of diversity you get. And obviously, you know, we're all aware of that, that software engineering as a discipline, is just suffering from a shortage of diversity in certain areas. And I think as we commit to that, locally, regionally, globally, there's so many types of different diversity that we get through the event. So I think both of these things are, you know, are super meaningful in like keeping the momentum of that community moving forward, because we want to continue to grow. And we want to continue to grow by welcoming folks that maybe necessarily didn't necessarily previously identify with the data engineering space, you know, into the community so that they can see what it's like and evaluative if they want to take a run that in their career. So I think all these things, transparency, openness, diversity, these are all Hallmark hallmarks of a great community and, and these are the engines that keep the community going and moving forward. Sometimes in spite of the resources or the lack of resources, you know, that a company like data council itself, can muster at any one time.
Tobias Macey
0:29:22
In terms of the conference itself, the tagline that you focused on, and we've talked a little bit about this already, is that they are no fluff, paraphrasing your specific wording, and as a way of juxtaposing them against some of the larger events that are a little bit more business oriented, not calling out any specific names. And I'm wondering what your guidelines are for fulfilling that promise, and why you think it is an important distinction? And conversely, what some of the benefits are for those larger organizations, and how the two scales of event play off each other to help foster the overall community?
Pete Soderling
0:30:02
Yeah, well, one, one thing here is, I think, comes to the mentality of the engineer. And then the other side of it is the mentality of the sponsor and the partner. And, you know, hey, I think engineers are just Noble. And like I said, engineers want transparency, they want to know how things work. They don't want to be oversold to, you know, they want to try product for the self. There's just all of these sort of things baked into the engineering mindset. And first and foremost, we want to be known as the conference in the community that respects that, like, that's the main thing, because engineers like without engineers, and our community, loving and getting to know each other, we're not careful about the opportunities in the context that we create for them, they're just going to run in the other direction. And so like, first and foremost, like those are the hallmarks of of what we're building from the engineering side. Then on the partnership side, I think companies are not great at understanding how engineers think recruiters are not great at talking to engineers, marketers are not great at talking to engineers. Yes, engineers need jobs. And yes, engineers need new products and tools, but to find companies that actually know how to respect the mental hurdles that engineers have to get through, in order to like get interested in your product or get interested in your job. You know, that's a super significant thing. And through my years of working in the space, I've done a lot of coaching and consulting with CTOs, specifically surrounding these two things, recruiting and, and marketing to engineers. And I think that awesome companies who respect the the central place that engineers have, and will continue to have in the innovation economy that's coming, realize that they have to change their tune in the way they approach these engineers. So I, you know, our conference platform is a mechanism that we can use to gently sort of steer and even teach some partners how to interact with engineers in a way that doesn't scare them away. And so just broadly speaking, like I mentioned, we're just super careful about how we integrate partners with our event. And we're always as a team trying to come up with, with better ways to message this and, and better ways to educate and, and sort of welcome sponsors, you know, into the special, the special network that we've built, but it's challenging, you know, like not not all marketers think alike. And not all marketers know how to talk to engineers, but we're committed to creating a new opportunity for them to engage with awesome data scientists and data engineers in a way that's valuable for both of them. And that's a really fun, big challenge. And, you know, we're not as worried about how much as it scales right now, as much as we were the quality, enhancing the quality of those interactions. And so that's what we're committed to as a brand. And, you know, it's not always easy. But we've we learned a lot, and we have a lot to learn. And we always sort of touch touch base with the community after the events and sort of asked the community what they thought and how they interact with the partners, then did they find a new job? And how did that happen? And so we're always trying to pour gasoline on what works, not respecting continue to innovate and move forward in that way,
Tobias Macey
0:33:03
in addition to your work on building these events, and growing the meetups, and overall network opportunities for people in the engineering space, you have also become an involved as an investor. And you've mentioned that you focus on data oriented businesses. And I'm curious how you ended up getting involved in that overall endeavor, and how that fits into your work with data Council and what your overall personal mission is. Oh, yeah.
Pete Soderling
0:33:30
Well, that's, that's definitely one of my side projects. As I mentioned, I want to help 1000 engineers start companies, and, you know, this is just part of what makes me tick, just helping software engineers through the conference, through advising their startups, you know, through investing through helping them figure out go to market, I guess a lot of this, this energy for me came, you know, from having started for companies, myself, and as an engineer who didn't go to school, but instead opted to start, you know, my first company in New York City in 2003. You know, there weren't a lot of people that had started three companies in New York by the time, the early, you know, sort of, or the layoffs came around. So yeah, I guess I've learned a lot of things the hard way. And I think a lot of engineers are kind of self taught. And they also learn things, they tend to learn things the hard way. So I guess, a lot of my passion there is again, sort of meeting engineers, where they're at how they learn. And you know, to them, like, I'm kind of a business guy. Now, I have experience with starting companies, building products, fundraising, marketing, building sales teams, and you know, most of those things are not necessarily been Top of Mind, for many software engineers that want to start a company, they have a great idea. They're totally capable of engineering it and building a product, but they need help, and all the other, you know, software, businesses, LZ stuff, as well as fundraising. So I guess I just figured, I've discovered the sort of special place I have in the world where I'm able to help them coach them through a lot of those businesses is I could never build the infrastructure that they're building or figure out the the algorithms or the models that they're building, but I can help them figure out how to pitch it to investors, how to pitch it to customers, how to go to market, how to hire a team that scales. And so I discovered that I just had, you know, through my ups and downs, as an entrepreneur, I've developed a large set of early stage hustle, experience, and I'm just super hot, happy to pass that on to other engineers who are also passionate about starting companies. So that's just something I find myself doing anyway, you know, as a mentor for 500 startups or as a mentor for other companies. And one thing led to another and soon I started to do angel investing. And now I have an Angeles syndicate, which is actually quite interesting, because it's backed by a bunch of CTOs and heads of data science and engineers from our community, who all co invest with me. And as I'm able to help companies bring their products to market startups come to market, oftentimes will be an investment opportunity there. And so I'll be another network of technical people who add value to that company even more. So I'm just sort of the, you know, a connector in this community. And the community is doing all kinds of awesome stuff, you know, inside and even sometimes outside of data console, which is just a testament to the power of community overall. And, you know, I just happened to be, I'm super grateful that I'm along for the ride. And I got to I got engineers who come to me and trust me for help. And I'm able to connect these dots and and help them succeed as well,
Tobias Macey
0:36:30
in terms of the ways that businesses are structured. I'm wondering what it is about companies that are founded by engineers and led by engineers that makes them stand out, and why you think that it's important that engineers start their own companies, and how that compares to businesses that are founded and run by people who are coming more from the business side. And just your overall experience in terms of the types of focus and levels of success that two different sort of founding patterns end up playing out?
Pete Soderling
0:37:03
Yeah, well, yeah. I mean, you can tell based on what I've been saying that I'm just super bullish on the software engineer. And, you know, does that mean that the software engineer as a persona or a mentality or a skill set, you know, is inherently awesome and has no weaknesses? And no problems? Like hell? No, of course not. And I think the some of the challenges of being a software engineer and how your mentality fits into the rest of the business are well documented. So I think all of us as engineers need to grow and diversify and increase the breadth of our skills. And so that has to happen. But on the other hand, if we believe that innovation is going to continue to sweep the world, and highly technical solutions, perhaps to sometimes non technical problems, perhaps sometimes the technical problems are going to continue to emerge. I feel like people who have the understanding of the the technical implications and the realities and the architectures and the science of those solutions, just have an interesting edge. So I think there's a lot of hope in teaching software engineers how to be great CEOs. And I think that's, that's increasingly happening. I mean, look at Jeff Lawson from Twilio. Or the guys from stripe, even Uber was started by an engineer, right? There was the the the quiet engineer at goober at Uber, Garrett, sort of quiet in terms of Travis, you know, who, who was a co founder of that company. So I think we're seeing engineers, not just start the most amazing companies that are that are changing the world, but they're increasingly in positions of becoming CEOs. And those companies, you know, I guess you might even take that one step further. And I'm kind of trying to be an engineer who's also been an operator and a founder. But now I'm, I'm stepping up to becoming a VC and, and being an investor. So I think there's the engineer VC, which is really interesting, as well. But I think that's a slightly different conversation. But but suffice it to say that engineers are bringing a valid mentality into all of these disciplines. And yes, of course, an engineer has to be taught to think like a CEO, and has to learn how to be strategic and has to learn sales and marketing skills. But I think it's just an awesome, awesome challenge to be able to layer those skills on top of engineering discipline that they already have. And I'm not saying this is the only way to start a company or that business people, you know, can't find awesome engineers to partner with them. I mean, honestly, I think an engineer often needs a business co founder, to help get things going. But I I'm coming at it from the engineering side, and then figuring out like, all the other support that the engineer needs to make a successful company, and that's just because I've chosen that particular way, but other people will be coming at it from the business side, and, and I'm sure that will be fine for them as well,
Tobias Macey
0:39:47
in terms of the challenges that are often faced for an engineering founder in growing a business and making it viable. What are some of the common points of conflict or misunderstandings or challenges that they encounter? And what are some of the ways that you typically work to help them in establishing the business and setting out on a path to success? Well, I
Pete Soderling
0:40:10
think the the biggest thing, you know, that I see is, many engineers are just afraid to sell. And unfortunately, you know, you can't have a business if you can't have some sales. And so somehow, engineers have to get over that hurdle. And that can be a long process. It's been a long process for me. And I still undersell what we do at data council to be honest, in some ways, and I have people around me to help me do that. And we want to do that, again, in a way that's in line with the community. But I'm constantly trying to figure out how to be essentially a better salesperson. But for me, that means that still retaining sticking to the core of my engineering values, which is honesty, transparency, enthusiasm, you know, value and really understanding how to articulate the value of what you're bringing in a way that's, that's unabashedly honest and transparent. So I think that's a, that's a really big thing for a pure engineer founder is, it can be difficult to go out there and figure out how to get your first customers, you know, how to start to sell yourself personally. And then the next step is how do you build a sales culture and a process and a team that's in line with your values as a company, and that scares, you know, that scares some engineer, because it's just terrifying to think about building a sales org, when you can barely accept that your product needs to be sold yourself. But I think that's just a you know, that's sort of ground zero for starting a company. And so, you know, I try and be as gentle as possible and, and sort of guiding engineers through that process. But I guess that's the one of the core hiccups that I think engineers have to figure out how to get over by bringing in other people they trust or getting advice, or, you know, you have to approach it sometimes with kid gloves, but teaching engineers how to sell in a way that's true to their values. I think it's just a really big, big elephant in this room that, you know, I constantly run into and, and try to help engineer founders with
Tobias Macey
0:42:08
in terms of the businesses that you evaluate for investing in what is your strategy for finding the opportunities? And what are the characteristics of a business that you look for when deciding whether or not to invest? And then as a corollary to all of that? Why is it that you're focusing primarily on businesses that are focused mainly in the data space, and the types of services that you're looking for, that you think will be successful and useful?
Pete Soderling
0:42:37
Yeah. Well, I guess last question. First, I think, you know, it's important to have focus as an investor, and not everybody can do everything awesome. I think it's also a strategy to building a new brand, and a niche fund in the face of the sequoias and the corners of the world. I think we might have like we might be past the day is when a new fund can easily build a brand. That's, that's that expansive. So I think this is just kind of, you know, typical marketing strategy. I think if you start to focus on a niche and do one niche really well, I think that produces network effects, smaller network effects inside that niche that then grow and grow and grow and develop. So I mean, I've chosen to focus in my professional life, on data, data, data science, data engineering, data analytics, that's data console. Partly that's because I just believe in the upside of that market. So I think that's just a natural analogue to a lot of my investing activity, because I'm knowledgeable about the space because I have a huge network in the space, because I'm looking at interested in talking to these companies all the time. Um, it's just a natural fit. That's not to say that I don't I mean, I'm also passionate about broader developer tools. And as you mentioned earlier, I'm an investor in clubhouse. I'm interested in security companies. So I think, you know, for me, there are some analogues to just like, deeply technical companies, you know, look by technical founders, that that also fit my thesis. But, you know, still it's a fairly niche, narrow thesis, like most of the stuff I do. On the investing side is b2b, I meet the companies through my network and, and through data Council, I think they're solving you know, meaningful problems in the b2b space. And other criteria often is, they may be supported by some previous open source success, or the company might be built on some current open source project that I feel gives them an unfair advantage when it comes to distribution, and getting a community excited about the product. So these are a few of the things that I look for, in terms of the investing thesis
Tobias Macey
0:44:39
in terms of the overall industry and some of the trends that you're observing and your interaction with engineers and with vetting the different conference presentations and meetup topics, what are some of the trends that you're most excited by? And what are some of the ones that you are either concerned by or some potential issues that you see coming down the road in terms of the overall direction that we are heading as far as challenges that might be lurking around the corner?
Pete Soderling
0:45:10
Well, I think, you know, one big thing there is just data science and ethics, ai fairness, bias in AI, and in deep learning models, ethics, when it comes to, you know, targeting customers, what data you keep on people, like I think all these things are just super interesting is business issues, their policy issues or business issues. At one level, they're also technical issues. So there's technical implementation stuff that's required. But I just think raising that discussion is important. And so that's one area that we're focusing on, and data Council in the next series of events that we run later this year. And next year, is elevating that content in front of our community so that it can be a matter of discussion, because I think those are important topics are always seen as the most technical. But I think they're super important in terms of us, trying to help the community steer and figure out where the ship is going in the future. So I think that's super interesting. And then, I guess on the technical side, I think the data ops world is starting to mature. I think that there's a lot of interesting opportunities there in the same way that the DevOps revolution, you know, revolutionized the way that software was built, tested, deployed, monitored, and companies like chef and New Relic, you know, came came out in perhaps the mid 2000s. I think we're at a similar inflection point, with data Ops, if you will. And there's more repeatable pipeline process. There's back testing and an auditing capabilities that are required for data pipelines that aren't always there. There's monitoring infrastructure that's being built, and some some cool companies that I've seen recently. So I think data Ops, and basically just elevating data engineering to the level that software engineering has been out for a while. It's definitely something that seems to be catching fire. And we also, you know, try and support to the conference as well.
Tobias Macey
0:47:06
Looking forward, what are some of the goals and plans that you have for the future of the data council business and the overall community and events?
Pete Soderling
0:47:17
Well, I think, as I mentioned, our biggest goal is to build the data and AI talent community worldwide. And I think that there's, we're building a network business. So I guess it kind of takes me back to when I started, hacker labs, which was the digital social network, and I thought we were building a digital product. And as I already mentioned, one thing led to another and now we have data console instead. Well, data console is, you know, butts and seats at events and community. And it's IRL, engineers getting together. But it's still a network. It's not a super fast digital, formal network. But it's a network. And it's a super meaningful network. So it's kind of interesting that after all, the ups and downs, like I still see ourselves as in the network building business. And I think the cool thing about building a network is once you build a network, there's lots of different value that you can add on it. So I think in terms of there's, there's really big ideas here, there's formalizing education and education offerings, there's consulting services that can be built on top of this network, where companies can help each other, or recruiting sort of fits in the same vein, I think there's there's other things as well, there's a fund, which I have mentioned is a side project that I have to help engineers in this community start companies. So I think there's, there's all kinds of interesting things that you can build on top of a network, once you've gotten there. And for now, you know, our events are essentially a breakeven business that just gives us an excuse, and the ability to grow this network, all around the world globally. But I think, you know, there's a much bigger, like phase two, or phase three of this company where we can build really awesome stuff based in this engineering network, and network and a brand that engineers trust that we've laid down in the in the early part of the building phase. So I'm really excited to see that and, and develop that strategy and mission going forward.
Tobias Macey
0:49:14
Are there any other aspects of your work on data council or your investment, or your overall efforts in the data space that we didn't discuss yet that you'd like to cover before we close out the show?
Pete Soderling
0:49:26
No, I think I think we covered a lot of stuff. I hope it was interesting for you and your audience. And I encourage folks to reach out to me and, and get in touch. If there's engineers out there that want to start companies. If there's engineers that want to participate in our community worldwide, we're always looking for awesome people to help us find screen talks. We're interested in awesome speakers as well. I'm always interested in talking to deep learning and AI researchers who are out there who might have ideas that they want to bring it to market. But yeah, you can reach me at Pete data council dot A. And I'm happy to plug you into our community. And yeah, if I can be helpful to anyone out there, I would just really encourage them to reach out.
Tobias Macey
0:50:09
And for anyone who does want to follow up with you, or keep in touch or follow along with the work that you're doing. I'll have your contact information in the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Pete Soderling
0:50:26
Yeah, I think, as I mentioned, I think maybe it comes down to this, this data ops thing, right? There's there's really interesting open source projects coming out, like Great Expectations is one interesting companies coming out like elemental, which is built around Dexter, which is an open source project. So I think that this is a really interesting niche sort of tooling area, specifically in the data engineering world that I think we should all be watching. And then I guess the other sort of category of tooling, I'm seeing it sort of related. It's also in the monitoring space, it's watching the data in your data warehouse, to see if there's anomalies that sort of pop up, because we're all we're pulling together data from so many different hundreds of sources now that I think it's a little bit tricky to watch for data quality, and integrity. And so I think there's a new suite of tools that are popping up in that data monitoring, um, space, which are very interesting. So those are a couple of areas that I'm interested that I'm interested in and looking at, especially when it comes to data engineering applications.
Tobias Macey
0:51:26
Well, thank you very much for taking the time today to share your experiences and building and growing these events series and contributing to the overall data community as well as your efforts on the investment and business side. So definitely an area that I find valuable and I've been keeping an eye on your conferences. There's been a lot of great talks that come out of it. So I appreciate all of your work on that front, and I hope you enjoy the rest of your day.
Pete Soderling
0:51:50
Yeah, thanks, Tobias. We'll see you at the data council conference sometime soon.
Tobias Macey
0:51:59
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts FD to engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers

Digging Into Data Replication At Fivetran - Episode 93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that Fivetran solves and the story of how it got started?
  • Integration of multiple data sources (e.g. entity resolution)
  • How is Fivetran architected and how has the overall system design changed since you first began working on it?
  • monitoring and alerting
  • Automated schema normalization. How does it work for customized data sources?
  • Managing schema drift while avoiding data loss
  • Change data capture
  • What have you found to be the most complex or challenging data sources to work with reliably?
  • Workflow for users getting started with Fivetran
  • When is Fivetran the wrong choice for collecting and analyzing your data?
  • What have you found to be the most challenging aspects of working in the space of data integrations?}}
  • What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
  • What do you have planned for the future of Fivetran?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINOD. Today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening and databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing George Fraser about five Tran a platform for shipping your data to data warehouses in a managed fashion. So George, can you start by introducing yourself?
George Fraser
0:01:54
Yeah, my name is George. I am the CEO of five Tran. And I was one of two co founders of five trend almost seven years ago when we started.
Tobias Macey
0:02:02
And do you remember how you first got involved in the area of data management?
George Fraser
0:02:05
Well, before five train, I was actually a scientist, which is a bit of an unusual background for someone in data management, although it was sort of an advantage for us that we were coming at it fresh and so much has changed in the area of data management, particularly because of the new data warehouses that are so much faster and so much cheaper and so much easier to manage than the previous generation, that a fresh approach is really merited. And so in a weird way, the fact that none of the founding team had a background in data management was kind of an advantage.
Tobias Macey
0:02:38
And so can you start by describing it about describing a bit about the problem that five Tran was built to solve and the overall story of how it got started, and what motivated you to build a company around it?
George Fraser
0:02:50
Well, I'll start with the story of how it got started. So in late 2012, when we started the company, Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013, five turn was originally a vertically integrated data analysis tool. So it had user interface that was sort of a super powered spreadsheets slash BI tool, it had a data warehouse on the inside, and it had a data pipeline that was feeding the data warehouse. And through many iterations of that idea, we discovered that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and the data pipeline became the product. And the problem that five trans solves, is the problem of getting all your company's data in one place. So companies today use all kinds of tools to manage their business. You use CRM systems, like Salesforce, you use payment systems, like stripe support systems like Zendesk finance systems like QuickBooks, or Zora, you have a production database somewhere, maybe you have 20 production databases. And if you want to know what is happening in your business, the first step is usually to synchronize all of this data into a single database, where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that five trend solves people use by trying to do other things. Sometimes they use the data warehouse that We're sinking to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync all of that data into a single database.
Tobias Macey
0:04:38
And in recent years, one of the prevalent approaches for being able to get all the data into one location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning. And then you can experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake.
George Fraser
0:05:07
Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either set up a data warehouse do full on Kimball dimensional schema, data modeling, and Informatica with all of the upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s3. And I say false choice, because I think the right approach is a happy medium, where you don't go all the way to sticking raw JSON files and CSV files in s3, that's really unnecessary. Instead, you use a proper relational data store. But you exercise restraint, and how much normalization and customization you do on the way in. So you say, I'm going to make my first goal to create an accurate replica of all the systems in one database, and then I'm going to leave that alone, that's going to be my sort of staging area, kind of like my data lake, except it lives in a regular relational data warehouse. And then I'm going to build whatever transformations I want to do have that data on top of that data lake schema. So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead, your data lake should just be a different logical system within the same database that you're using to analyze all your data. And to support your BI tool. It's just a higher productivity simpler workflow to do it that way.
Tobias Macey
0:06:47
Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the LT pattern has been coming. Because of the flexibility of these cloud data warehouses that you've mentioned, as far as being able to consume semi structured and unstructured data while still being able to query across it and introspective for the purposes of being able to join with other information that's already within that system.
George Fraser
0:07:11
Yeah, the LT pattern is really a just a great way to get work done. It's simple. It allows you to recover from mistakes. So if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you want to transform the data. The great advantage of the LT pattern is that the original untransformed data is still sitting there side by side in the same database. So it's just really easy to iterate in a way that it isn't. If you're transforming the data on the fly, or even if you have a data lake where you like store the API responses from all of your systems, that's still more complicated than if you just have this nice replica sitting in its own schema in your data warehouse.
Tobias Macey
0:07:58
And so one of the things that you pointed out is needing to be able to integrate across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce for CRM, or things like ticket tracking, and user feedback, such as Zendesk, etc. And I'm wondering what your experience has been as far as being able to map the sort of logical entities across these different systems together to be able to effectively join and query across those data sets, given that they don't necessarily have the shared sense of truth for things like how customers are presented, or even what these sort of common field names might be to be able to map across those different, those different entities.
George Fraser
0:08:42
Yeah, this is a really important step. And the first thing we always advise our customers to do. And even anyone who's building a data warehouse, I would advise to do this is that you need to keep straight in your mind that there's really two problems here. The first problem is replicating all of the data. And the second problem is analyzing all the data into a single schema. And you need to think of these as two steps, you need to follow proper separation of concerns, just as you would in a software engineering project. So we really focus on that first step on replication. What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use sequel, sequel is a great tool for unionizing things, joining things, changing field names, filtering data, all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of sequel queries that run inside your data warehouse.
Tobias Macey
0:09:44
And do you have your own tooling and interfaces for being able to expose that process to your end users? Or do you also integrate with tools such as DBT, for being able to have that overall process controlled by the end user. So
George Fraser
0:10:00
we originally did not do anything in this area other than give advice, and we got the advantage that we got to sort of watch what our users did in that context. And what we saw is that a lot of them set up cron to run sequel scripts on a regular schedule. A lot of them used liquor, persistent Dr. Tables, some people use airflow, they used air flow, and kind of a funny way, they didn't really use the Python parts of air flow, they just use their flow as a way to trigger sequel. And when DVD came out, we have a decent community of users who use DBT. And we're supportive of whatever mechanism you want to use to transform your data, we do now have our own transformation tool built into our UI. And it's the first version that you can use right now. It's basically a way that you can provide the sequel script, and you can trigger that sequel script, when five Tran delivers new data to your tables. And we've got lots of people using the first version of that that's going to continue to evolve over the rest of this year, it's going to get a lot more sophistication. And it's going to do a lot more to give you insight into the transforms that are running, and how they all relate to each other. But the core idea of it is that sequel is the right tool for transforming data.
Tobias Macey
0:11:19
And before we get too far into the rest of the feature set and capabilities of five Tran, I'm wondering if you can talk about how the overall system is architected, and how the overall system design has evolved since you first began working on it.
George Fraser
0:11:33
Yeah, so the overall architecture is fairly simple. The hard part of five trend is really not the sort of high class data problems, things like queues and streams and giant data sets flying around. The hard part of five trend is really all of the incidental complexity of all of these data sources, understanding all the small sort of crazy rules that every API has. So most of our effort over the years has actually been devoted to hunting down all these little details of every single data source we support. And that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture was essentially a bunch of EC two instances, with cron, running a bunch of Java processes that were running on a on a fast batch cycle, sinking people's data. Over the last year and a half, the engineering team has built a new architecture based on Kubernetes. There are many advantages of this new architecture for us internally, the biggest one is that it auto scales. But from the outside, you can't even tell when you migrate from the old architecture to the new architecture other than you have to whitelist a new set of IPS. So the you know, it was a very simple architecture. In the beginning, it's gotten somewhat more complex. But really, the hard part of five train is not the high class data engineering problems. It's the little details of every data source, so that from the users perspective, you just get this magical replica of all of your systems in a single database.
Tobias Macey
0:13:16
And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers. I'm curious what you're using for monitoring and alerting strategy and any sort of ongoing continuous testing, as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?
George Fraser
0:13:42
Yeah, well, first of all, there's several layers to that the first one you is actually the testing that we do on our end to validate that all of our sink strategies, all those little details I mentioned a minute ago are actually working correctly, our testing problem is quite difficult, because we interoperate with so many external systems. And in many cases, you really have to run the tests against the real system for the test to be meaningful. And so our build architecture is actually one of the more complex parts of five train, we use a build tool called Bazell. And we've done a lot of work, for example, to run all of the databases and FTP servers and things like that, that we have to interact with in Docker containers so that we can actually produce reproducible Ed tests. So that actually is one of the more complex engineering problems at five trend. And if that sounds interesting to you, I encourage you to apply to our engineering team, because we have lots more work to do on that. So that's the first layer is really all of those tests that we run to verify that our sync strategies are correct. The second layer is that, you know, is it working in production is the customers data actually getting sick and as a getting synced correctly, and one of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all five trans data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize, we stop. Now, that's different than when you build data pipelines yourself, when you build data pipelines for your own company, usually, you will have them try to keep going no matter what. But five train is a fully managed service. And we're monitoring and all the time. So we tend to make the opposite choice of anything suspicious is going on, the correct thing to do is just stop and alert five Tran, hey, go check out this customers data pipeline, what the heck is going on? Something unexpected happen is happening. And we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail, we get alerted and the customer gets alerted at the same time. And then we communicate with the customer. And we say hey, we may need to go in and check something Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out what's going wrong, because five trained as a fully managed service. And that is critical to making it work. When you do we do and you say we are going to take responsibility for actually creating an accurate replica of all of your systems in your data warehouse. That means you're signing on to comprehend and fix every little detail of every data source that you support. And a lot of those little details only come up in production when some customer shows up. And they're using a feature of Salesforce that Salesforce hasn't sold for five years, but they've still got it. And you've never seen it before. Some of a lot of those little things only come up in production. The nice thing is that that set of little things, well, it is very large, it is finite. And we only have to discover each problem once and then every customer thereafter. benefits from that. Thanks
Tobias Macey
0:17:00
for the system itself. One of the things that I noticed while I was reading through the documentation, and the feature set is that for all of these different source systems, you provide automated schema normalization. And I'm curious how that works. And the overall logic flow that you have built in, if it's just a static mapping that you have, for each different data source, are there some sort of more complex algorithm that's going on behind the scenes there, as well as how that works for any sort of customized data sources, such as application databases that you're working with, or maybe just JSON feeds or event streams?
George Fraser
0:17:38
Sure. So the first thing you have to understand is that there's really two categories of data sources in terms of schema normalization. The first category is databases, like Oracle, or my sequel, or Postgres, and database, like systems like NetSuite is really basically a database when you look at the API. So Salesforce, there's a bunch of systems that basically look like David bases, they have arbitrary tables, columns, you can set any types you want in any column, what we do with those systems is we just create an exact one to one replica of the source schema, it's really as simple as that. So there's a lot of work to do, because the change feeds are usually very complicated from those systems. And it's very complex. To turn those change feeds back into the original schema, but it is automated. So for databases and database, like skeet systems, we just produce the exact same schema and your data warehouse as it was in the source for apps are things like stripe, or Zendesk or GitHub or JIRA, we do a lot of normalization of the data. So tools like that, when you look at the API responses, the API responses are very complex and nested, and usually very far from the original normalized schema that this data probably lived in, in the source database. And every time we add a new data source of that type, we study the data source, we, I joke that we reverse engineer the API, we basically figure out what was the schema and the database that this originally was, and we unwind all the API responses back into the normalized schema. These days, we often just get an engineer at the company that is that data source on the phone and ask them, you know, what is the real schema here, we can find we found that we can save ourselves a whole lot of work by doing that. But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think, if we put in that work up front to normalize the data in your data warehouse, we can save every single one of our customers a whole bunch of time, traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in up front so our customers don't have to.
Tobias Macey
0:20:00
One of the other issues that comes up with normalization. And particularly for the source database systems that you're talking about is the idea of schema drift, when new fields are added or removed, or a data types change, or the overall sort of the sort of default data types change. And we're wondering how you manage schema drift overall, in the data warehouse systems that you're loading into well, preventing data loss, particularly in the cases where a column might be dropped, or the data type changed.
George Fraser
0:20:29
Yeah, so it's, it's, there's a core pipeline that all five trend, connectors, databases, apps, everything is written against that we use internally. And all of the rules of how to deal with schema drift are encoded there. So some cases are easy. Like, if you drop a column, then that data just isn't arriving anymore, we will leave that column in your data warehouse, we're not going to delete it in case there's something important in it, you can drop it in your data warehouse, if you want to, we're not going to, if you add a column, again, that's pretty easy. We add a column and your data warehouse, all of the old rows will have nodes in that column, obviously, but then going forward, we will populate that column. The tricky cases are when you change the types. So when you when you alter the type of an existing column that can be more difficult to deal with. Now, we will actually, there's two principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're going to go and change the type of the column in your data warehouse to fit the new data. And the second principle we follow is that when you change types, sometimes you sort of contradict yourself. And we follow the rule of subtitling in in handling that, if you think back to your undergraduate computer science classes, this is the good old concept of subtypes, for example, and into the subtype of a real a real is a subtype of a string, etc. So we, we look at all the data passing through the system, and we infer what is the most specific type that can contain all of the values that we have seen. And then we alter the data warehouse to be that type, so that we can actually fit the data into the data warehouse.
Tobias Macey
0:22:17
Another capability that you provide is Change Data Capture for when you're loading from these relational database systems into the data warehouse. And that's a problem space that I've always been interested in as far as how you're able to capture the change logs within the data system, and then be able to replay them effectively to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform?
George Fraser
0:22:46
Yeah, it's very complicated. Most people who build in house data pipelines, as you say, they just do a dump and load the entire table, because the change logs are so complicated. And the problem with dumping load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people once a day. So we do Change Data Capture, we read the change logs of each database, each database has a different change log format, most of them are extremely complicated. If you look at the MySQL change log format, or the Oracle change log format, it is like going back in time to the history of MySQL, you can sort of see every architectural change in MySQL in the change log format the answer to how we do that there's, there's no trick. It's just a lot of work, understanding all the possible corner cases of these chains lugs, it helps that we have many customers with each database. So the unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users, we have lots of Postgres users. And so over time, we see all the little corner cases, and you eventually figure it out, you eventually find all the things and you get a system that just works. But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at five trend, who at this point, has been working on it for years with, you know, hundreds of customers. So at this point, it's you know, we've invested so much effort in tracking that all those little things, there's just like no hope that you could do better yourself, building a change the reader just for your own company
Tobias Macey
0:24:28
for the particular problem space the year and you have a sort of many too many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses. And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to get it production ready in the shortest amount of time.
George Fraser
0:24:57
Yeah, it's funny, you know, if you ask any engineer, five Randall, they can all tell you what the most difficult data sources are, because we've had to do so much work on on them over the years. Undoubtedly, the most difficult data sources is Mark Hedo, close seconds or JIRA, Asana and then probably NetSuite. So those API's, they just have a ton of incidental complexity, it's really hard to get data out of them fast. We're working with some of these sources to try to help them improve their API's to make it easier to do replication, but there there's a handful of data sources that have required disproportionate work to to get them working reliably. In general, one funny observation that we have seen over the years is that the companies with the the best API's tend to unfortunately be the least successful companies. It seems to be a general principle that companies which have really beautiful well, organized API's tend to not be very successful businesses, I guess, because they're just not focused enough on sales or something. We've seen it time and again, where we integrate a new data source, and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sink for them. The one exception, I would say is stripe, which has a great API, and is a highly successful company. And that's probably because their API is their products. So there's there's definitely a spectrum of difficulty. In general, the oldest largest companies have the most complex API's,
Tobias Macey
0:26:32
I wonder if there's some reverse incentive where they make their API's obtuse and difficult to work with, so that they can build up an ecosystem around them of contractors who are whose sole purpose is to be able to integrate them with other systems.
George Fraser
0:26:46
You know, I think there's a little bit of that, but less than you would think. For example, the company that has by far the most extensive ecosystem of contractors, helping people integrate their tool with the other systems is Salesforce. And Salesforce is API is quite good. Salesforce is actually one of the simpler API is out there. It was harder a few years ago when we first implemented it. But they made a lot of improvements. And it's actually one of the better API's now.
Tobias Macey
0:27:15
Yeah, I think that's probably coming off the tail of their acquisition of MuleSoft to sort of reformat their internal systems and data representation to make it easier to integrate. Because I know beforehand, it was just a whole mess of XML.
George Fraser
0:27:27
You know, it was really before the meal soft acquisition that a lot of the improvements in the Salesforce API happened, the Salesforce REST API was I was pretty well structured and rational, five years ago, it would fail a lot, you would send queries and they would just not return when you had really big data sets, and now it's more performance. So I think it predates the Millsap acquisition, they just did the hard work to make all the corner cases work reliably and scale the large data sets and, and Salesforce is now one of the easier data sources to actually think there are certain objects that have complicated rules. And I think the developers at five train who work on Salesforce will get mad at me when they hear me say this. But compared to like NetSuite, it's, it's pretty great.
Tobias Macey
0:28:12
On the other side of the equation, where you're loading data into the different target data warehouses, I'm wondering what your strategy is, as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of equal representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses.
George Fraser
0:28:36
So most of the code for doing the load side is shared between the data warehouses, the differences are not that great between different destinations, except Big Query Big Query is a little bit of a unusual creature. So if you look at five trans code base, there's actually a different implementation for Big Query that shares very little with all of the other destiny. So the differences between destinations are not that big of a problem for us, there are certain things that that do, you know, there's functions that have to be overwritten for different destinations for things like the names of types and, and there's some special cases around performance where our load strategies are slightly different, for example, between snowflake and redshift, just to get faster performance. But in general, that actually is the easier side of the business is the destinations. And then in terms of transformations, it's really up to the user to write the sequel that transforms their data. And it is true that to write effective transformations, especially incremental transformations, you always have to use the proprietary features of the particular database that you're working on.
Tobias Macey
0:29:46
On the incremental piece, I'm interested in how you address that for some of the different source systems, because for the databases, where you're doing Change Data Capture, it's fairly obvious that you can take that approach for a data loading. But for some of the more API oriented systems, I'm wondering if there are if there's a high degree of variability of being able to pull in just the objects that have changed since a certain last sync time, or if there are a number of systems that will just give you absolutely everything every time and then you have to do the thing on your side,
George Fraser
0:30:20
the complexity of those dangers. I know I mentioned this earlier, but it is it is staggering. But yes, I'm the API side, we're also doing Change Data Capture of apps, it is different for every app, but just about every API we work with provides some kind of change feed mechanism. Now it is complicated, you often end up in a situation where the API will give you a change feed that's incremental, but then other endpoints are not incremental. So you have to do this thing where you read the change feed, and you look at the individual events and the change feed, and then you go look up the related information from the other entity. So you end up dealing with a bunch of extra complexity because of that. But as with all things at five train, we have this advantage that we have many customers with each data source. So we can, we can put in that disproportionate effort that you would never do if you were building it just for yourself to make the change capture mechanism work properly, because we just have to do it once and then everyone who uses that data source can benefit from it.
Tobias Macey
0:31:23
For people who are getting on boarded onto the five trans system. I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like, as they're adding new sources or just interacting with their five trading account for being able to keep track of the overall health of their system, or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side,
George Fraser
0:31:47
it's pretty simple. The joke at five trend is that our demo takes about 15 seconds. So because we're so committed to automation, and we're so committed to this idea that five trends fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it, it means that there's very little UI, the process of setting up a new data source is basically Connect source, which for many sources is as simple as just going through an OAuth redirect, and you just click you know, yes, 510 is allowed to access my data. And that's it. And connect destination which, which now we're actually integrated with snowflake and big queries, you can just push a button in snowflake or in Big Query and create a five train account that's pre connected to your data warehouse. So the setup process is really simple. There's once after setup, there's a bunch of UI around monitoring what's happening, we like to say that five Tran is a glass box, it was originally a black box. And now it's it's a glass box, you can see exactly what it's doing. You can't change it. But you can see exactly what we're doing at all times. And you know, part of that is in the UI. And part of that is an emails you get when things go wrong and or the sink finishes for the first time, that kind of thing.
Tobias Macey
0:33:00
Part of that visibility, I also noticed that you will ship the transaction logs to the end users log aggregation system. And I thought that was an interesting approach, as far as being able to give them away to be able to access all of that information in one place without having to go to your platform just for that one off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall visibility into your systems and the ways that they're able to integrate it into their monitoring platforms?
George Fraser
0:33:34
Yeah, so the logs we're talking about are the logs of every action five train took like five drain made this API call against Salesforce five ran ran this log minor query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like cloud watch or stack driver, because a lot of companies have like a in the same way, they have a set centralized data warehouse, they have a centralized logging system. It's mostly used by larger companies, those are the ones who invest the effort in setting up those centralized logging systems. And it's actually the system we built first, before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just there's a quick way to view what's going on. And, yeah, I think people have appreciated that we're happy to support the systems they already have, rather than try to build our own thing and force you to use that.
Tobias Macey
0:34:34
I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall lifecycle of their data as it spans across different systems.
George Fraser
0:34:47
You know, that's not so much of a logging problem, that's more of like a metadata problem inside the data warehouse, when you're trying to track lineage to say, like, this row in my data warehouse came from this transformation, which came from these three tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etc. that lineage problem is really more of a metadata problem. And that's kind of a Greenfield in our area right now. There's a couple different companies that are trying to solve that problem. We're doing some interesting work on that in conjunction with our transformations. I think it's a very important problem. It's still still a lot of work to be done there.
Tobias Macey
0:35:28
So on the sales side of things to I know, you said that your demo is about 15 seconds as far as Yes, you just do this, this and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have that bring them to you as far as evaluating your platform for their use cases. And just some of the overall user experience design that you've put into the platform as well, to help ease that onboarding process.
George Fraser
0:35:58
Yeah, so a lot of the discussions in the sales process really revolve around that ELT philosophy of five train is going to take care of replicating all of your data, and then you're going to cure curated non-destructively using sequel, which for some people just seems like the obvious way to do it. But for others, this is a very shocking proposition, this idea that your data warehouse is going to have this comparatively and curated schema, that five trend is delivering data into and then you're basically going to make a second copy of everything. For a lot of people who've been doing this for a long time. That's a very surprising approach. And so a lot of the discussion and sales are rolls around the trade offs of that and why we think that's the right answer for the data warehouses that exists today, which are just so much faster, and so much cheaper, that it makes sense to adopt that more human friendly workflow than maybe it would have in the
Tobias Macey
0:36:52
90s. And what are the cases where five trend is the wrong choice for being able to replicate data or integrated into it data warehouse?
George Fraser
0:37:00
Well, if you already have a working system, you should keep using it. So I would we don't advise people to change things just for the sake of change. If you've set up, you know, a bunch of Python scripts that are sinking all your data sources, and it's working, keep using it, what usually happens that causes people to take out a system is schema changes, death by 1000 schema changes. So they find that the data sources upstream are changing, their scripts that are sinking, their data are constantly breaking, it's this huge effort to keep them alive. And so that's the situation where prospects will abandon existing system and adopt five trend. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it, just just keep using it.
Tobias Macey
0:37:49
And as far as the overall challenges or complexities of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult overcome, or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering building their own pipeline from scratch.
George Fraser
0:38:11
Yeah, you know, I think that when we got our first customer in 2015, sinking Salesforce to redshift, and two weeks later, we got our second customer thinking Salesforce and HubSpot and stripe into redshift, I sort of imagined that this sync problem was like going to be we were going to have this solved in a year. And then we would go on and build a bunch of other related tools. And the sink problem is much harder than it looks at first, getting all the little details, right. So that it just works is an astonishingly difficult problem. It, it is a parallel parallel problem, you can have lots of developers working on different data sources, figuring out all those little details, we have accumulated general lessons that we've incorporated and adore core code. So we've gotten better at doing this over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of complexity associated with just doing replication correctly, was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem, you know, you look at the API docs have a data source, and you figure Oh, I think I know how I'm going to sync this. And then you go into production with 10 customers. And suddenly, you find 10 different corner cases that you never thought of that are going to make it harder than you expected to sink the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing.
Tobias Macey
0:39:45
on both the technical and business side, I'm also interested in understanding what you have found to be as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing five Tran?
George Fraser
0:39:59
Well, I've talked about some of the technical lessons in terms of you know, just solving that problem really well as is both really hard and, and really valuable. In terms of the business lessons we've learned. It's, you know, growing the company is like a co equal problem to growing the technology, I've been really pleased with how we've made a place where people seem to genuinely like to work, where a lot of people have been able to develop their careers in different ways different people have different career goals. And you need to realize that as someone leading a company, not everyone at this company is like myself, they have different goals that they want to accomplish. So that that problem of growing the company is just as important. And just as complex as solving the technical problems and growing the product and growing the sales side and helping people find out that you have this great product that they should probably be using. So I think that has been a real lesson for me over the last seven years that we've been doing this now for the future of five trend, what do you have planned both on the business roadmap, as well as the feature sets that you're looking to integrate into five Tran and just some of the overall goals that you have for the business as you look forward?
Tobias Macey
0:41:11
Sure. So
George Fraser
0:41:12
some of the most important stuff we're doing right now is on the sales and marketing side, we have done all of this work to solve this replication problem, which is very fundamental and very reusable. And I like to say no one else should have to deal with all of these API's. Since we have done it, you should not need to write a bunch of Python scripts to sink your data or configure Informatica or anything like that. And we've done it once so that you don't have to, and I guarantee you, it will cost you less to buy five trend than to have your own team basically building a house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that five trends out there. And that might be something that's really useful to you on the product side, we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations tool in our product, there's going to be a lot more sophistication getting added to that over the next year, we really view that as the next frontier for five trend is helping people manage the data after we've replicated that,
Tobias Macey
0:42:17
are there any other aspects of the five train company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show?
George Fraser
0:42:28
I don't think so I think the the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources. The you know, Kafka is not going to solve this problem for you. spark is not going to solve this problem for you. There is no fancy technical solution. Most of the difficulty of the data centralization problem is just in understanding and working around all of the incidental complexity of all these data sources.
Tobias Macey
0:42:58
For anybody who wants to get in touch with you or follow along with the work that you and five Tran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
George Fraser
0:43:15
Yeah, I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data after it arrives. So writing all the sequel that curates the data into a format that's ready for the business users to attack with BI tools is a huge amount of work, it remains a huge amount of work. And if you look at the workflow of the typical analysts, they're writing a ton of sequel. And they're using tools that it's a very analogous problem to a developer writing code using Java or C sharp, but the tools that analysts have to work with look like the tools developers had in like the 80s. I mean, they don't even really have autocomplete. So I think that is a really under invested then problem, just the tooling for analysts to make them more productive in the exact same way. As we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts to and I think it hasn't happened yet.
Tobias Macey
0:44:13
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it five Tran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front and I hope Enjoy the rest of your day.
George Fraser
0:44:31
Thanks for having me on.

Data Labeling That You Can Feel Good About - Episode 89

Summary

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what CloudFactory is and the story behind it?
  • What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
  • What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
  • Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
    • What protocols do you have in place to ensure data quality and identify potential sources of bias?
  • What role do humans play in the lifecycle for AI and ML projects?
  • I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
    • How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
  • Can you share some stories of cloud workers who have benefited from their experience working with your company?
  • What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
  • What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
  • What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
  • What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
    • How does that tie into your plans for CloudFactory in the medium to long term?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Evolving An ETL Pipeline For Better Productivity - Episode 83

Summary

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
  • Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
    • What are your primary sources of data and what are the targets that you are loading them into?
  • What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
    • What were your criteria for your replacement technology and how did you gather and evaluate your options?
  • Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
    • What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
    • What were the big wins?
  • What was your evaluation framework for determining whether your re-engineering was successful?
  • Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
    • If you have freed up time for your engineers, how are you allocating that spare capacity?
  • What do you hope to see from DataCoral in the future?
  • What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Lineage For Your Pipelines - Episode 82

Summary

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pachyderm is and how it got started?
    • What is new in the last two years since I talked to Dan Whitenack in episode 1?
    • How have the changes and additional features in Kubernetes impacted your work on Pachyderm?
  • A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?
  • Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?
    • How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?
  • There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?
  • Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?
    • What is the interface for exposing and exploring that provenance data?
  • What are some of the advanced capabilities of Pachyderm that you would like to call out?
  • With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?
  • What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?
  • What do you have planned for the future of Pachyderm?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Customer Analytics At Scale With Segment - Episode 72

Summary

Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Segment is and how the business got started?
    • What are some of the primary ways that your customers are using the Segment platform?
    • How have the capabilities and use cases of the Segment platform changed since it was first launched?
  • Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the overall structure of Segment and the driving force behind their design and use?
  • What are some of the best practices for structuring custom events in a way that they can be easily integrated with downstream platforms?
    • How do you manage changes or errors in the events generated by the various sources that you support?
  • How is the Segment platform architected and how has that architecture evolved over the past few years?
  • What are some of the unique challenges that you face as a result of being a many-to-many event routing platform?
  • In addition to the various services that you integrate with for data delivery, you also support populating of data warehouses. What is involved in establishing and maintaining the schema and transformations for a customer?
  • What have been some of the most interesting, unexpected, and/or challenging lessons that you have learned while building and growing the technical and business aspects of Segment?
  • What are some of the features and improvements, both technical and business, that you have planned for the future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Enterprise Big Data Systems At LEGO - Episode 66

Summary

Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?
    • What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data?
    • What was the transition process like, migrating data silos into a uniformly managed platform?
  • What are the biggest data challenges that you face at LEGO?
  • What are some of the most critical sources and types of data that you are managing?
  • What are the main components of the data infrastructure that you have built to support the organizations analytical needs?
    • What are some of the technologies that you have found to be most useful?
    • Which have been the most problematic?
  • What does the team structure look like for the data services at LEGO?
    • Does that reflect in the types/numbers of systems that you support?
  • What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support?
  • What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO?
  • How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed?
  • How does the global nature of the LEGO business influence the design strategies and technology choices for your platform?
  • What are you most excited for in the coming year?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you refresh our memory about what TimescaleDB is?
  • How has the market for timeseries databases changed since we last spoke?
  • What has changed in the focus and features of the TimescaleDB project and company?
  • Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
    • What were the most challenging aspects of reaching that goal?
  • In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
    • How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
  • What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?
  • How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
    • Have you been able to leverage some of the native improvements to simplify your implementation?
    • Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
  • What is in store for the future of the Timescale product and organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of a data pipeline?
    • At what point in the life of a project or organization should you start thinking about building a pipeline?
  • In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
    • What metrics/use cases should you be optimizing for at this point?
  • What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
    • How do the design requirements for a data pipeline change as you reach this stage?
    • What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
  • What are some of the changes that are necessary as you move to a large scale data pipeline?
  • At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
  • In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
  • When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
  • Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
  • What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
    • How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
  • What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
  • What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
  • What are your plans for improving your current pipeline at Grubhub?
  • What are some references that you recommend for anyone who is designing a new data platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA