ETL

Digging Into Data Replication At Fivetran - Episode 93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that Fivetran solves and the story of how it got started?
  • Integration of multiple data sources (e.g. entity resolution)
  • How is Fivetran architected and how has the overall system design changed since you first began working on it?
  • monitoring and alerting
  • Automated schema normalization. How does it work for customized data sources?
  • Managing schema drift while avoiding data loss
  • Change data capture
  • What have you found to be the most complex or challenging data sources to work with reliably?
  • Workflow for users getting started with Fivetran
  • When is Fivetran the wrong choice for collecting and analyzing your data?
  • What have you found to be the most challenging aspects of working in the space of data integrations?}}
  • What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
  • What do you have planned for the future of Fivetran?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINOD. Today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening and databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing George Fraser about five Tran a platform for shipping your data to data warehouses in a managed fashion. So George, can you start by introducing yourself?
George Fraser
0:01:54
Yeah, my name is George. I am the CEO of five Tran. And I was one of two co founders of five trend almost seven years ago when we started.
Tobias Macey
0:02:02
And do you remember how you first got involved in the area of data management?
George Fraser
0:02:05
Well, before five train, I was actually a scientist, which is a bit of an unusual background for someone in data management, although it was sort of an advantage for us that we were coming at it fresh and so much has changed in the area of data management, particularly because of the new data warehouses that are so much faster and so much cheaper and so much easier to manage than the previous generation, that a fresh approach is really merited. And so in a weird way, the fact that none of the founding team had a background in data management was kind of an advantage.
Tobias Macey
0:02:38
And so can you start by describing it about describing a bit about the problem that five Tran was built to solve and the overall story of how it got started, and what motivated you to build a company around it?
George Fraser
0:02:50
Well, I'll start with the story of how it got started. So in late 2012, when we started the company, Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013, five turn was originally a vertically integrated data analysis tool. So it had user interface that was sort of a super powered spreadsheets slash BI tool, it had a data warehouse on the inside, and it had a data pipeline that was feeding the data warehouse. And through many iterations of that idea, we discovered that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and the data pipeline became the product. And the problem that five trans solves, is the problem of getting all your company's data in one place. So companies today use all kinds of tools to manage their business. You use CRM systems, like Salesforce, you use payment systems, like stripe support systems like Zendesk finance systems like QuickBooks, or Zora, you have a production database somewhere, maybe you have 20 production databases. And if you want to know what is happening in your business, the first step is usually to synchronize all of this data into a single database, where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that five trend solves people use by trying to do other things. Sometimes they use the data warehouse that We're sinking to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync all of that data into a single database.
Tobias Macey
0:04:38
And in recent years, one of the prevalent approaches for being able to get all the data into one location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning. And then you can experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake.
George Fraser
0:05:07
Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either set up a data warehouse do full on Kimball dimensional schema, data modeling, and Informatica with all of the upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s3. And I say false choice, because I think the right approach is a happy medium, where you don't go all the way to sticking raw JSON files and CSV files in s3, that's really unnecessary. Instead, you use a proper relational data store. But you exercise restraint, and how much normalization and customization you do on the way in. So you say, I'm going to make my first goal to create an accurate replica of all the systems in one database, and then I'm going to leave that alone, that's going to be my sort of staging area, kind of like my data lake, except it lives in a regular relational data warehouse. And then I'm going to build whatever transformations I want to do have that data on top of that data lake schema. So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead, your data lake should just be a different logical system within the same database that you're using to analyze all your data. And to support your BI tool. It's just a higher productivity simpler workflow to do it that way.
Tobias Macey
0:06:47
Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the LT pattern has been coming. Because of the flexibility of these cloud data warehouses that you've mentioned, as far as being able to consume semi structured and unstructured data while still being able to query across it and introspective for the purposes of being able to join with other information that's already within that system.
George Fraser
0:07:11
Yeah, the LT pattern is really a just a great way to get work done. It's simple. It allows you to recover from mistakes. So if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you want to transform the data. The great advantage of the LT pattern is that the original untransformed data is still sitting there side by side in the same database. So it's just really easy to iterate in a way that it isn't. If you're transforming the data on the fly, or even if you have a data lake where you like store the API responses from all of your systems, that's still more complicated than if you just have this nice replica sitting in its own schema in your data warehouse.
Tobias Macey
0:07:58
And so one of the things that you pointed out is needing to be able to integrate across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce for CRM, or things like ticket tracking, and user feedback, such as Zendesk, etc. And I'm wondering what your experience has been as far as being able to map the sort of logical entities across these different systems together to be able to effectively join and query across those data sets, given that they don't necessarily have the shared sense of truth for things like how customers are presented, or even what these sort of common field names might be to be able to map across those different, those different entities.
George Fraser
0:08:42
Yeah, this is a really important step. And the first thing we always advise our customers to do. And even anyone who's building a data warehouse, I would advise to do this is that you need to keep straight in your mind that there's really two problems here. The first problem is replicating all of the data. And the second problem is analyzing all the data into a single schema. And you need to think of these as two steps, you need to follow proper separation of concerns, just as you would in a software engineering project. So we really focus on that first step on replication. What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use sequel, sequel is a great tool for unionizing things, joining things, changing field names, filtering data, all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of sequel queries that run inside your data warehouse.
Tobias Macey
0:09:44
And do you have your own tooling and interfaces for being able to expose that process to your end users? Or do you also integrate with tools such as DBT, for being able to have that overall process controlled by the end user. So
George Fraser
0:10:00
we originally did not do anything in this area other than give advice, and we got the advantage that we got to sort of watch what our users did in that context. And what we saw is that a lot of them set up cron to run sequel scripts on a regular schedule. A lot of them used liquor, persistent Dr. Tables, some people use airflow, they used air flow, and kind of a funny way, they didn't really use the Python parts of air flow, they just use their flow as a way to trigger sequel. And when DVD came out, we have a decent community of users who use DBT. And we're supportive of whatever mechanism you want to use to transform your data, we do now have our own transformation tool built into our UI. And it's the first version that you can use right now. It's basically a way that you can provide the sequel script, and you can trigger that sequel script, when five Tran delivers new data to your tables. And we've got lots of people using the first version of that that's going to continue to evolve over the rest of this year, it's going to get a lot more sophistication. And it's going to do a lot more to give you insight into the transforms that are running, and how they all relate to each other. But the core idea of it is that sequel is the right tool for transforming data.
Tobias Macey
0:11:19
And before we get too far into the rest of the feature set and capabilities of five Tran, I'm wondering if you can talk about how the overall system is architected, and how the overall system design has evolved since you first began working on it.
George Fraser
0:11:33
Yeah, so the overall architecture is fairly simple. The hard part of five trend is really not the sort of high class data problems, things like queues and streams and giant data sets flying around. The hard part of five trend is really all of the incidental complexity of all of these data sources, understanding all the small sort of crazy rules that every API has. So most of our effort over the years has actually been devoted to hunting down all these little details of every single data source we support. And that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture was essentially a bunch of EC two instances, with cron, running a bunch of Java processes that were running on a on a fast batch cycle, sinking people's data. Over the last year and a half, the engineering team has built a new architecture based on Kubernetes. There are many advantages of this new architecture for us internally, the biggest one is that it auto scales. But from the outside, you can't even tell when you migrate from the old architecture to the new architecture other than you have to whitelist a new set of IPS. So the you know, it was a very simple architecture. In the beginning, it's gotten somewhat more complex. But really, the hard part of five train is not the high class data engineering problems. It's the little details of every data source, so that from the users perspective, you just get this magical replica of all of your systems in a single database.
Tobias Macey
0:13:16
And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers. I'm curious what you're using for monitoring and alerting strategy and any sort of ongoing continuous testing, as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?
George Fraser
0:13:42
Yeah, well, first of all, there's several layers to that the first one you is actually the testing that we do on our end to validate that all of our sink strategies, all those little details I mentioned a minute ago are actually working correctly, our testing problem is quite difficult, because we interoperate with so many external systems. And in many cases, you really have to run the tests against the real system for the test to be meaningful. And so our build architecture is actually one of the more complex parts of five train, we use a build tool called Bazell. And we've done a lot of work, for example, to run all of the databases and FTP servers and things like that, that we have to interact with in Docker containers so that we can actually produce reproducible Ed tests. So that actually is one of the more complex engineering problems at five trend. And if that sounds interesting to you, I encourage you to apply to our engineering team, because we have lots more work to do on that. So that's the first layer is really all of those tests that we run to verify that our sync strategies are correct. The second layer is that, you know, is it working in production is the customers data actually getting sick and as a getting synced correctly, and one of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all five trans data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize, we stop. Now, that's different than when you build data pipelines yourself, when you build data pipelines for your own company, usually, you will have them try to keep going no matter what. But five train is a fully managed service. And we're monitoring and all the time. So we tend to make the opposite choice of anything suspicious is going on, the correct thing to do is just stop and alert five Tran, hey, go check out this customers data pipeline, what the heck is going on? Something unexpected happen is happening. And we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail, we get alerted and the customer gets alerted at the same time. And then we communicate with the customer. And we say hey, we may need to go in and check something Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out what's going wrong, because five trained as a fully managed service. And that is critical to making it work. When you do we do and you say we are going to take responsibility for actually creating an accurate replica of all of your systems in your data warehouse. That means you're signing on to comprehend and fix every little detail of every data source that you support. And a lot of those little details only come up in production when some customer shows up. And they're using a feature of Salesforce that Salesforce hasn't sold for five years, but they've still got it. And you've never seen it before. Some of a lot of those little things only come up in production. The nice thing is that that set of little things, well, it is very large, it is finite. And we only have to discover each problem once and then every customer thereafter. benefits from that. Thanks
Tobias Macey
0:17:00
for the system itself. One of the things that I noticed while I was reading through the documentation, and the feature set is that for all of these different source systems, you provide automated schema normalization. And I'm curious how that works. And the overall logic flow that you have built in, if it's just a static mapping that you have, for each different data source, are there some sort of more complex algorithm that's going on behind the scenes there, as well as how that works for any sort of customized data sources, such as application databases that you're working with, or maybe just JSON feeds or event streams?
George Fraser
0:17:38
Sure. So the first thing you have to understand is that there's really two categories of data sources in terms of schema normalization. The first category is databases, like Oracle, or my sequel, or Postgres, and database, like systems like NetSuite is really basically a database when you look at the API. So Salesforce, there's a bunch of systems that basically look like David bases, they have arbitrary tables, columns, you can set any types you want in any column, what we do with those systems is we just create an exact one to one replica of the source schema, it's really as simple as that. So there's a lot of work to do, because the change feeds are usually very complicated from those systems. And it's very complex. To turn those change feeds back into the original schema, but it is automated. So for databases and database, like skeet systems, we just produce the exact same schema and your data warehouse as it was in the source for apps are things like stripe, or Zendesk or GitHub or JIRA, we do a lot of normalization of the data. So tools like that, when you look at the API responses, the API responses are very complex and nested, and usually very far from the original normalized schema that this data probably lived in, in the source database. And every time we add a new data source of that type, we study the data source, we, I joke that we reverse engineer the API, we basically figure out what was the schema and the database that this originally was, and we unwind all the API responses back into the normalized schema. These days, we often just get an engineer at the company that is that data source on the phone and ask them, you know, what is the real schema here, we can find we found that we can save ourselves a whole lot of work by doing that. But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think, if we put in that work up front to normalize the data in your data warehouse, we can save every single one of our customers a whole bunch of time, traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in up front so our customers don't have to.
Tobias Macey
0:20:00
One of the other issues that comes up with normalization. And particularly for the source database systems that you're talking about is the idea of schema drift, when new fields are added or removed, or a data types change, or the overall sort of the sort of default data types change. And we're wondering how you manage schema drift overall, in the data warehouse systems that you're loading into well, preventing data loss, particularly in the cases where a column might be dropped, or the data type changed.
George Fraser
0:20:29
Yeah, so it's, it's, there's a core pipeline that all five trend, connectors, databases, apps, everything is written against that we use internally. And all of the rules of how to deal with schema drift are encoded there. So some cases are easy. Like, if you drop a column, then that data just isn't arriving anymore, we will leave that column in your data warehouse, we're not going to delete it in case there's something important in it, you can drop it in your data warehouse, if you want to, we're not going to, if you add a column, again, that's pretty easy. We add a column and your data warehouse, all of the old rows will have nodes in that column, obviously, but then going forward, we will populate that column. The tricky cases are when you change the types. So when you when you alter the type of an existing column that can be more difficult to deal with. Now, we will actually, there's two principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're going to go and change the type of the column in your data warehouse to fit the new data. And the second principle we follow is that when you change types, sometimes you sort of contradict yourself. And we follow the rule of subtitling in in handling that, if you think back to your undergraduate computer science classes, this is the good old concept of subtypes, for example, and into the subtype of a real a real is a subtype of a string, etc. So we, we look at all the data passing through the system, and we infer what is the most specific type that can contain all of the values that we have seen. And then we alter the data warehouse to be that type, so that we can actually fit the data into the data warehouse.
Tobias Macey
0:22:17
Another capability that you provide is Change Data Capture for when you're loading from these relational database systems into the data warehouse. And that's a problem space that I've always been interested in as far as how you're able to capture the change logs within the data system, and then be able to replay them effectively to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform?
George Fraser
0:22:46
Yeah, it's very complicated. Most people who build in house data pipelines, as you say, they just do a dump and load the entire table, because the change logs are so complicated. And the problem with dumping load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people once a day. So we do Change Data Capture, we read the change logs of each database, each database has a different change log format, most of them are extremely complicated. If you look at the MySQL change log format, or the Oracle change log format, it is like going back in time to the history of MySQL, you can sort of see every architectural change in MySQL in the change log format the answer to how we do that there's, there's no trick. It's just a lot of work, understanding all the possible corner cases of these chains lugs, it helps that we have many customers with each database. So the unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users, we have lots of Postgres users. And so over time, we see all the little corner cases, and you eventually figure it out, you eventually find all the things and you get a system that just works. But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at five trend, who at this point, has been working on it for years with, you know, hundreds of customers. So at this point, it's you know, we've invested so much effort in tracking that all those little things, there's just like no hope that you could do better yourself, building a change the reader just for your own company
Tobias Macey
0:24:28
for the particular problem space the year and you have a sort of many too many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses. And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to get it production ready in the shortest amount of time.
George Fraser
0:24:57
Yeah, it's funny, you know, if you ask any engineer, five Randall, they can all tell you what the most difficult data sources are, because we've had to do so much work on on them over the years. Undoubtedly, the most difficult data sources is Mark Hedo, close seconds or JIRA, Asana and then probably NetSuite. So those API's, they just have a ton of incidental complexity, it's really hard to get data out of them fast. We're working with some of these sources to try to help them improve their API's to make it easier to do replication, but there there's a handful of data sources that have required disproportionate work to to get them working reliably. In general, one funny observation that we have seen over the years is that the companies with the the best API's tend to unfortunately be the least successful companies. It seems to be a general principle that companies which have really beautiful well, organized API's tend to not be very successful businesses, I guess, because they're just not focused enough on sales or something. We've seen it time and again, where we integrate a new data source, and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sink for them. The one exception, I would say is stripe, which has a great API, and is a highly successful company. And that's probably because their API is their products. So there's there's definitely a spectrum of difficulty. In general, the oldest largest companies have the most complex API's,
Tobias Macey
0:26:32
I wonder if there's some reverse incentive where they make their API's obtuse and difficult to work with, so that they can build up an ecosystem around them of contractors who are whose sole purpose is to be able to integrate them with other systems.
George Fraser
0:26:46
You know, I think there's a little bit of that, but less than you would think. For example, the company that has by far the most extensive ecosystem of contractors, helping people integrate their tool with the other systems is Salesforce. And Salesforce is API is quite good. Salesforce is actually one of the simpler API is out there. It was harder a few years ago when we first implemented it. But they made a lot of improvements. And it's actually one of the better API's now.
Tobias Macey
0:27:15
Yeah, I think that's probably coming off the tail of their acquisition of MuleSoft to sort of reformat their internal systems and data representation to make it easier to integrate. Because I know beforehand, it was just a whole mess of XML.
George Fraser
0:27:27
You know, it was really before the meal soft acquisition that a lot of the improvements in the Salesforce API happened, the Salesforce REST API was I was pretty well structured and rational, five years ago, it would fail a lot, you would send queries and they would just not return when you had really big data sets, and now it's more performance. So I think it predates the Millsap acquisition, they just did the hard work to make all the corner cases work reliably and scale the large data sets and, and Salesforce is now one of the easier data sources to actually think there are certain objects that have complicated rules. And I think the developers at five train who work on Salesforce will get mad at me when they hear me say this. But compared to like NetSuite, it's, it's pretty great.
Tobias Macey
0:28:12
On the other side of the equation, where you're loading data into the different target data warehouses, I'm wondering what your strategy is, as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of equal representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses.
George Fraser
0:28:36
So most of the code for doing the load side is shared between the data warehouses, the differences are not that great between different destinations, except Big Query Big Query is a little bit of a unusual creature. So if you look at five trans code base, there's actually a different implementation for Big Query that shares very little with all of the other destiny. So the differences between destinations are not that big of a problem for us, there are certain things that that do, you know, there's functions that have to be overwritten for different destinations for things like the names of types and, and there's some special cases around performance where our load strategies are slightly different, for example, between snowflake and redshift, just to get faster performance. But in general, that actually is the easier side of the business is the destinations. And then in terms of transformations, it's really up to the user to write the sequel that transforms their data. And it is true that to write effective transformations, especially incremental transformations, you always have to use the proprietary features of the particular database that you're working on.
Tobias Macey
0:29:46
On the incremental piece, I'm interested in how you address that for some of the different source systems, because for the databases, where you're doing Change Data Capture, it's fairly obvious that you can take that approach for a data loading. But for some of the more API oriented systems, I'm wondering if there are if there's a high degree of variability of being able to pull in just the objects that have changed since a certain last sync time, or if there are a number of systems that will just give you absolutely everything every time and then you have to do the thing on your side,
George Fraser
0:30:20
the complexity of those dangers. I know I mentioned this earlier, but it is it is staggering. But yes, I'm the API side, we're also doing Change Data Capture of apps, it is different for every app, but just about every API we work with provides some kind of change feed mechanism. Now it is complicated, you often end up in a situation where the API will give you a change feed that's incremental, but then other endpoints are not incremental. So you have to do this thing where you read the change feed, and you look at the individual events and the change feed, and then you go look up the related information from the other entity. So you end up dealing with a bunch of extra complexity because of that. But as with all things at five train, we have this advantage that we have many customers with each data source. So we can, we can put in that disproportionate effort that you would never do if you were building it just for yourself to make the change capture mechanism work properly, because we just have to do it once and then everyone who uses that data source can benefit from it.
Tobias Macey
0:31:23
For people who are getting on boarded onto the five trans system. I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like, as they're adding new sources or just interacting with their five trading account for being able to keep track of the overall health of their system, or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side,
George Fraser
0:31:47
it's pretty simple. The joke at five trend is that our demo takes about 15 seconds. So because we're so committed to automation, and we're so committed to this idea that five trends fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it, it means that there's very little UI, the process of setting up a new data source is basically Connect source, which for many sources is as simple as just going through an OAuth redirect, and you just click you know, yes, 510 is allowed to access my data. And that's it. And connect destination which, which now we're actually integrated with snowflake and big queries, you can just push a button in snowflake or in Big Query and create a five train account that's pre connected to your data warehouse. So the setup process is really simple. There's once after setup, there's a bunch of UI around monitoring what's happening, we like to say that five Tran is a glass box, it was originally a black box. And now it's it's a glass box, you can see exactly what it's doing. You can't change it. But you can see exactly what we're doing at all times. And you know, part of that is in the UI. And part of that is an emails you get when things go wrong and or the sink finishes for the first time, that kind of thing.
Tobias Macey
0:33:00
Part of that visibility, I also noticed that you will ship the transaction logs to the end users log aggregation system. And I thought that was an interesting approach, as far as being able to give them away to be able to access all of that information in one place without having to go to your platform just for that one off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall visibility into your systems and the ways that they're able to integrate it into their monitoring platforms?
George Fraser
0:33:34
Yeah, so the logs we're talking about are the logs of every action five train took like five drain made this API call against Salesforce five ran ran this log minor query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like cloud watch or stack driver, because a lot of companies have like a in the same way, they have a set centralized data warehouse, they have a centralized logging system. It's mostly used by larger companies, those are the ones who invest the effort in setting up those centralized logging systems. And it's actually the system we built first, before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just there's a quick way to view what's going on. And, yeah, I think people have appreciated that we're happy to support the systems they already have, rather than try to build our own thing and force you to use that.
Tobias Macey
0:34:34
I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall lifecycle of their data as it spans across different systems.
George Fraser
0:34:47
You know, that's not so much of a logging problem, that's more of like a metadata problem inside the data warehouse, when you're trying to track lineage to say, like, this row in my data warehouse came from this transformation, which came from these three tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etc. that lineage problem is really more of a metadata problem. And that's kind of a Greenfield in our area right now. There's a couple different companies that are trying to solve that problem. We're doing some interesting work on that in conjunction with our transformations. I think it's a very important problem. It's still still a lot of work to be done there.
Tobias Macey
0:35:28
So on the sales side of things to I know, you said that your demo is about 15 seconds as far as Yes, you just do this, this and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have that bring them to you as far as evaluating your platform for their use cases. And just some of the overall user experience design that you've put into the platform as well, to help ease that onboarding process.
George Fraser
0:35:58
Yeah, so a lot of the discussions in the sales process really revolve around that ELT philosophy of five train is going to take care of replicating all of your data, and then you're going to cure curated non-destructively using sequel, which for some people just seems like the obvious way to do it. But for others, this is a very shocking proposition, this idea that your data warehouse is going to have this comparatively and curated schema, that five trend is delivering data into and then you're basically going to make a second copy of everything. For a lot of people who've been doing this for a long time. That's a very surprising approach. And so a lot of the discussion and sales are rolls around the trade offs of that and why we think that's the right answer for the data warehouses that exists today, which are just so much faster, and so much cheaper, that it makes sense to adopt that more human friendly workflow than maybe it would have in the
Tobias Macey
0:36:52
90s. And what are the cases where five trend is the wrong choice for being able to replicate data or integrated into it data warehouse?
George Fraser
0:37:00
Well, if you already have a working system, you should keep using it. So I would we don't advise people to change things just for the sake of change. If you've set up, you know, a bunch of Python scripts that are sinking all your data sources, and it's working, keep using it, what usually happens that causes people to take out a system is schema changes, death by 1000 schema changes. So they find that the data sources upstream are changing, their scripts that are sinking, their data are constantly breaking, it's this huge effort to keep them alive. And so that's the situation where prospects will abandon existing system and adopt five trend. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it, just just keep using it.
Tobias Macey
0:37:49
And as far as the overall challenges or complexities of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult overcome, or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering building their own pipeline from scratch.
George Fraser
0:38:11
Yeah, you know, I think that when we got our first customer in 2015, sinking Salesforce to redshift, and two weeks later, we got our second customer thinking Salesforce and HubSpot and stripe into redshift, I sort of imagined that this sync problem was like going to be we were going to have this solved in a year. And then we would go on and build a bunch of other related tools. And the sink problem is much harder than it looks at first, getting all the little details, right. So that it just works is an astonishingly difficult problem. It, it is a parallel parallel problem, you can have lots of developers working on different data sources, figuring out all those little details, we have accumulated general lessons that we've incorporated and adore core code. So we've gotten better at doing this over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of complexity associated with just doing replication correctly, was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem, you know, you look at the API docs have a data source, and you figure Oh, I think I know how I'm going to sync this. And then you go into production with 10 customers. And suddenly, you find 10 different corner cases that you never thought of that are going to make it harder than you expected to sink the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing.
Tobias Macey
0:39:45
on both the technical and business side, I'm also interested in understanding what you have found to be as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing five Tran?
George Fraser
0:39:59
Well, I've talked about some of the technical lessons in terms of you know, just solving that problem really well as is both really hard and, and really valuable. In terms of the business lessons we've learned. It's, you know, growing the company is like a co equal problem to growing the technology, I've been really pleased with how we've made a place where people seem to genuinely like to work, where a lot of people have been able to develop their careers in different ways different people have different career goals. And you need to realize that as someone leading a company, not everyone at this company is like myself, they have different goals that they want to accomplish. So that that problem of growing the company is just as important. And just as complex as solving the technical problems and growing the product and growing the sales side and helping people find out that you have this great product that they should probably be using. So I think that has been a real lesson for me over the last seven years that we've been doing this now for the future of five trend, what do you have planned both on the business roadmap, as well as the feature sets that you're looking to integrate into five Tran and just some of the overall goals that you have for the business as you look forward?
Tobias Macey
0:41:11
Sure. So
George Fraser
0:41:12
some of the most important stuff we're doing right now is on the sales and marketing side, we have done all of this work to solve this replication problem, which is very fundamental and very reusable. And I like to say no one else should have to deal with all of these API's. Since we have done it, you should not need to write a bunch of Python scripts to sink your data or configure Informatica or anything like that. And we've done it once so that you don't have to, and I guarantee you, it will cost you less to buy five trend than to have your own team basically building a house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that five trends out there. And that might be something that's really useful to you on the product side, we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations tool in our product, there's going to be a lot more sophistication getting added to that over the next year, we really view that as the next frontier for five trend is helping people manage the data after we've replicated that,
Tobias Macey
0:42:17
are there any other aspects of the five train company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show?
George Fraser
0:42:28
I don't think so I think the the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources. The you know, Kafka is not going to solve this problem for you. spark is not going to solve this problem for you. There is no fancy technical solution. Most of the difficulty of the data centralization problem is just in understanding and working around all of the incidental complexity of all these data sources.
Tobias Macey
0:42:58
For anybody who wants to get in touch with you or follow along with the work that you and five Tran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
George Fraser
0:43:15
Yeah, I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data after it arrives. So writing all the sequel that curates the data into a format that's ready for the business users to attack with BI tools is a huge amount of work, it remains a huge amount of work. And if you look at the workflow of the typical analysts, they're writing a ton of sequel. And they're using tools that it's a very analogous problem to a developer writing code using Java or C sharp, but the tools that analysts have to work with look like the tools developers had in like the 80s. I mean, they don't even really have autocomplete. So I think that is a really under invested then problem, just the tooling for analysts to make them more productive in the exact same way. As we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts to and I think it hasn't happened yet.
Tobias Macey
0:44:13
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it five Tran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front and I hope Enjoy the rest of your day.
George Fraser
0:44:31
Thanks for having me on.

Simplifying Data Integration Through Eventual Connectivity - Episode 91

Summary

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
  • What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
  • In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
    • How do different implementations of graph databases impact their viability for this use case?
  • Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
  • How much up-front modeling is necessary to make this a viable approach to data integration?
  • How do the volume and format of the source data impact the technology and architecture decisions that you would make?
  • What are the limitations or edge cases that you have found when using this pattern?
  • In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
  • What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities within startups that are changing the world than Angel list is the place to go go to data engineering podcast.com slash angel to sign up today. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing Tim Ward about his thoughts on eventual connectivity is a new pattern to replace traditionally to and just as a full disclosure, Tim is the CEO of clued in who is a sponsor of the podcast. So Tim, can you just start by introducing yourself?
Tim Ward
0:02:09
Yeah, sure. My name is Tim board. As Tobias said, I'm the CEO of a data platform called clued in. I'm based out of Copenhagen, Denmark, I have with me my wife, my little boy, Finn, and a little dog that looks like and he walk called Seymour.
Tobias Macey
0:02:29
And do you remember how you first got involved in the area of data management?
Unknown
0:02:32
Yeah, so I mean, I'm, I'm, I guess, a classically trained software engineer, I've been working in software space for around 1314 years now, I've been predominantly working in the web space, but mostly for enterprise businesses. And around, I don't know, maybe six or seven years ago, I was given a project, which was in the space of what's called multivariate testing, it's the idea that if you you've got a website, and maybe the homepage of a website, if we make some changes, or different variations, which which variation works better for the amount of traffic that you're wanting to attract, or maybe the amount of purchases that the company makes on the website. So I mean, using this, that was my first foray into, okay, so this involves me having to capture data on analytics, that then took me down this rabbit hole of realizing, got it, I have to not only get the analytics from the website, but I need to correlate this against, you know, back office systems, CRM systems and, you know, ERP systems and pin systems. And I kind of realized, Oh, my God, this becomes quite tricky with the integration piece. And once I went down that rabbit hole, I realized all for me to actually start doing something with this data, I need to clean it, I need to normalize it. And, you know, basically, I got to this point where I realized what data is kind of a hard thing to work with, it's not something you can pick up and just start getting value out of straight away. So that's kind of what led me into the the path of around four half, five years ago saying, you know what, I'm going to get into this data space. And ever since then, I've just enjoyed immensely being able to help large enterprises in becoming a lot more data driven.
Tobias Macey
0:04:31
And so to frame the discussion a bit, I'm wondering if you can just start by discussing some of the challenges and shortcomings that you have seen in the existing practices of ET? Oh,
Unknown
0:04:42
yes, sure. I mean, I guess I want to stop by not trying to be that grumpy old man that's yelling at all technologies. And I'm always this person. That is one thing I've learned in my Korea is that it's very rare that a particular technology or approach is right, or wrong. It's just right, for the right use case. And I mean, also, you're seeing a lot more patterns in in integration emerge, of course, we've got the ATL that's been around forever, you've got this LT approach, which has been emerging over the last few years. And then you've kind of seen streaming platforms also take up the idea of joining streams of data instead of something that is kind of done up front. And, and, you know, to be honest, I've always wondered, with ATL, how on earth are people achieving this for an entire company, you know, like ATL for me has always been something that if you've got 234 tools to be able to integrate, it's a fantastic kind of approach, right? But you know, now we're definitely dealing with a lot more data sources, and the demand for having free flowing data available is becoming much more apparent. And it was to the point where I thought, am I the stupid one like, I can't, if I have to use ATL to integrate data from multiple sources, as soon as we go over a certain limit of data sources, the problem just exponentially becomes a lot harder. And I think the thing that I found interesting as well with this ATL approach is that typically, once the data was processed through these classic, you know, designers, workflow gags, you know, directed a cyclical graphs. And the output of this process was typically, oh, I'm going to store this in a relational database. And therefore, you know, I can understand why ETFs existed, I can understand that. Yeah, if, if you know what you're going to do with your data after this ATL process, I mean, classically, you would go into something like a data warehouse, I can see why that existed. And I think I think there's just different types of demands that are in the market today, there's much more need for, you know, flexibility and access to data, and not necessarily data that's been modeled as rigidly as you do get in the kind of classical data warehouses. And I kind of thought, well, the relational database is not the only database available to us, as engineers, and one of the ones that I've been focusing on for the last few years is this graph database. And I kind of when you think about it, most problems that we're trying to solve in the modeling world today, they are actually a network, they are a graph, they're not necessarily a relational or a kind of flat document store. So I thought, you know, this seems more like the right store to be able to model the data. And I mean, I think the second thing was that, just from being hands on, I found that this ATL process, what it meant was that when I was trying to solve problems and integrate data up front, I had to know what we're all about business rules that dictated how the systems integrate, but what dictated clean data, and you probably Tobias used to these ETFs, designers, where I get these built in functionalities to do things like, you know, trim white space, and tokenize, the text and things like that, and you think, yet, but I need to know up front, what is considered a bad ID or a bad record, you're probably also used to seeing, you know, we've got these IDs, and sometimes it's, it's a beautiful looking ID and sometimes it's negative one, or na or, you know, placeholder or hyphen, and you think I've got it up front in the ATL world define what are all those possibilities before I run my ATL job, and I just found this quite rigid in its approach. And, and I think the king kind of game changer for me was that when I was using ATL and these classes, designers to integrate more than five systems, I realized how long up front, it took that I needed to go around the different parts of the business and have them explain. Okay, so how does the Salesforce lead table connect to the market early table? Like, how does it do that, and then time after time, after, you know, weeks of investigation, I would realize, Oh, I have to jump to the I don't know, the exchange, the exchange server or the Active Directory to get the information that I need to join those two systems. And it just, it just resulted in the spaghetti of point to point integrations. And I think that's one of the key things that ETFs suffers from is that it puts us in an architectural design thinking pattern of Oh, how am I going to map systems point to point and I can tell you after working in this industry for five, five years so far, that systems don't naturally blend together point to point.
Tobias Macey
0:10:04
Yeah, your point to about the fact that you need to understand what are all the possible representations of a no value means that in order for a pipeline to be sufficiently robust, you have to have a fair amount of quality testing built in to make sure that any values that are coming through the system map to some of the existing values that you're expecting, and then be able to raise an alert when you see something that's outside of those bounds, so that you can then go ahead and fix it. And then being able to have some sort of a dead letter Q or bad data queue for holding those records, until you get to a point where you can reprocess them, and then being able to go through and back populate the data. So it definitely is something that requires a lot of engineering effort in order to be able to have something that is functional for all the potential values. And also there's the aspect of schema evolution and being able to figure out how to propagate that through your system and have your logic, flexible enough to be able to handle different schema values for cases where you have data flowing through that is at the transition boundary between the two schemas. So certainly a complicated issue. And so you recently released a white paper, discussing some of your ideas on this concept of eventual connectivity. And so wondering if you can describe your thoughts on that and touch on how it addresses some of the issues that you've seen with the more traditional ATL pattern.
Unknown
0:11:38
Yeah, sure. I mean, I think one of the concepts that behind this pattern, we've kind of named eventual connectivity and is the it there's there's a couple of fundamental things to understand. First of all, it's a it's a pattern that essentially embraces the idea that we can throw data into a store. And as we continue to throw more data records will find out itself how to be merged. And it's the idea of being able to place records into this kind of central place this central repository with little hooks, with little hooks that are flags that are indicating, hey, I'm a record. And here are my unique references. So, you know, obviously with the idea being that as we bring in more systems, those other records will say, Hey, I actually have the same ID. Now, that might not happen up front, it might be after you've integrated system 123456, that system, two and four are able to say, Hey, I now have the missing pieces to be able to merge our records. So in an eventual connectivity world. What this this really brings in advantages is that, first of all, if I'm trying to integrate systems, I only need to take one system at a time. And I found it rare in the enterprise that I could find someone who understood the domain knowledge behind their Salesforce account, and their Oracle, Oracle account or and their Marketa account, I would often run into someone who completely understood the business domain behind the Salesforce account. And for the reason I'm using that as an example is because Salesforce is an example of a system where you can do anything in it, you can add objects, that, uh, you know, animals are dinosaurs, not just the ones that are out of the box, I don't know who's selling to dinosaurs. But essentially, what this allows me to do is when I walk into an integration job, and that business says, Hey, we have three systems, I say, got it. And if they say, Oh, sorry, that was actually 300 systems, I go, God, it makes no difference. To me, it's only a time based thing, the complexity doesn't get more complex because of this type of pattern that we're taking. And I'll explain the pattern. Essentially, what we do is we, you can conceptualize it, as we go through a system, a record at a time or an object at a time, let's take something like leads or contacts. And the patent basically asks us to highlight what our unique references to that object. So in the case of a person, it might be something like a passport number, it might be, you know, a local personal identifier. And you know, in Denmark, we have what's called the CPI number, which is a unique reference to me, no one else in Denmark and have the same number, then you get to things like emails, and what you discovered pretty quickly in, in enterprise, in enterprise data world is that email in no way as a unique identifier of an object, right, we can have group emails that refer to multiple different people. And, you know, not all systems will specify as if this is a group email of this is a an email referring to an individual. So the pattern asks us or dictates us to mock those references as aliases, something that could allude to a unique identifier of an object. And then when we get to the referential pieces, so imagine that we have a contact that's associated with a company, you could probably imagine that as a call a column in the contact table, that's called company ID. And the key thing with the eventual connectivity pattern is that, although I want to highlight that as a unique reference to another object, I don't want to tell the the integration pattern where that object exists, I don't want it to tell that it's in the Salesforce organization table. Because to be honest, if that's a unique reference, that unique reference my exist in other systems. And so what this means is that I can take an individual system at a time and not have to have this standard Point to Point type of relationship between data. And I think if I was a highlight kind of three main wins that you get out of this, I think the first is that it's quite powerful to walk into a large business and say, hey, how many systems do you have? Well, we have 1000. And I think, good, when can we start? Now if I was in the ATL approach, I will be thinking, Oh, God, are we can we actually honestly do this, like,
0:16:40
as you could probably know, yourself, Tobias, often, we go into projects with big smiling faces. And then when you see the data, you realize, Oh, this is going to be a difficult project. So that advantage of being able to walk in and say I don't care how many systems you have, it makes not a lot of complexity difference to me. I think the other pieces that the eventual connectivity pattern addresses this idea of that you don't need to know all the business rules up front of what, how systems connect to each other, but then what's considered bad data versus good data. And in rather that, you know, we let things happen, and we have a much more reactive approach to be able to rectify them. And I think this is more cognizant, or it's more representative of the world we look into, that we live in today, companies are wanting more real time data to their consumers or to the consumption technologies, where we get the value things like business intelligence, etc. And they're not willing to put up with these kind of rigid approaches of Oh, detail processes broken down, I need to go back to our design, I need to update that and run it through and make sure that we we guarantee that, you know, the data is in the perfect order. Before we actually do the merging. I think the final thing that has become obvious time after time, where I've seen companies use this pattern is that this eventual connectivity pattern will discover joins, where it's really hard for you and me to sit down and figure out where these joins are. And I think it comes back to this core idea that systems don't connect well point to point, there's not always a nice ID that are this ubiquitous ID that we can just join two systems together, often we have to jump in between different data sources, to be able to wrangle this into a unified type of set of data. Now, at the same time, I can't deny that, you know, like, there's quite a lot of work that's going on in the field of, you know, ETFs, you've got platforms like Nye phi, and air flow, and you know what, those are still very valid, they're still, you know, they're very good at moving data around, they're fantastic at breaking down a workflow into these kind of just Greek components that can, in some cases, play independently. I think that the eventual connectivity patent for us time after time has allowed us to blend systems together without this overhead of complexity. And Tobias, there's not a big enough whiteboard in the world, when it comes to integrating, you know, 50 systems, it, you just have to put yourself in that situation, realize, oh, wow, the old ways of doing it, I just not scaling.
Tobias Macey
0:19:31
And as you're talking through this idea of eventual connectivity, I'm wondering how it ends up being materially different from a data lake where you're able to just do the more ELT pattern of just ship all of the data into this repository without having to worry about modeling it up front and understanding what all the mappings are, and then doing some exploratory analysis after the fact to be able to then create all of these connection points between the different data sources and do whatever cleaning happens after the fact.
Unknown
0:20:03
Yeah, I mean, you one thing I've gathered in my career, as well as that, you know, something like a an overall data pipeline for a business is going to be made up of so many different components. And in our world, in the in the eventual connectivity world, the light still makes complete sense to have, I see the lake as a place to dump data. There, I can read it in a ubiquitous language, in most cases, its sequel that it's exposed, you know, I don't know a single person in our industry that doesn't know sequel to some perspective. So that that is fantastic to have that light there. Where I see the problem often evolving, is that the Lakers is obviously kind of a place where we would typically store raw data. It's where we abstract away the complexity that Oh, now I need if I need data from a SharePoint site, I have to learn the SharePoint API. No. But though, the like is there to basically say, that's already been done. I'm going to give you sequel, that's the way that you're going to get this data, what I find is that when I look at the companies that we work with, is that, yes, but there's a lot that needs to be done from the lake, to where we can actually get the value. I think something like machine learning is a good example, Time after time we hear and it's true that machine learning machine learning doesn't really work that well, if you're not working with good quality, well integrated data that is complete. So it's missing, you know, novels, and it's missing empty columns and things like that. And what I found is that we went through this in our industry, we went through this this period, where we said, okay, well, the lake is going to give the data science teams and the different teams direct access to the role. And what we found is that every time they tried to use that data, they went through the common practices of now I need to blend it. Now I need to catalog now I need to normalize it and clean it. And you could see that the eventual connectivity pattern is there to say, No, no, this is something that sits in between the lake that matures, the data to the point where it's already blended. And that's one of the biggest challenges I kind of see there is that, you know, if I get, you know, a couple of different files out of, of the lake, and then I go to investigate how this joins together, I still have this, you know, this experience of all, this doesn't easily blend together. So then I go on this exploratory this discovery phase of what other datasets Do I have to use to string these two systems together? And we would call it just like to eliminate that.
Tobias Macey
0:22:46
So to make this a bit more concrete for people who are listening and wondering how they can put this pattern into effect in their own systems, can you talk through an example system architecture and data flow for use case that you have done or at least experimented with to be able to put this into effect and how the overall architecture plays together to make this a viable approach? And how those different connection points between the data systems end up manifesting?
Unknown
0:23:17
Yeah, definitely I. So maybe it's good to use an example, imagine you have three or four data sources that you need to blend, you need to adjust it, you then need to usually merge the records into kind of a flat, flat, unified data set. And then you need to, you know, push this somewhere. So you might be a data warehouse, something like Big Query or redshift, etc. And the fact is that, you know, in today's world, that data also needs to be available for the data science team. And now it needs to be available for things like exploratory business intelligence. So when you're building your integrations, I think architecturally from up from a modeling perspective, the three things that you need to think about what we call entity codes, aliases, and edges, and those three pieces together is what we need to be able to map this properly into a graph store. And simply put an entity code is is kind of a absolute unique reference to an object, as I alluded to before, something like a passport number. And that's a unique reference to an object that by itself, just the passport number, and doesn't mean that it's unique across all the systems that you have at your workplace.
0:24:42
So the other is aliases. So aliases is more of like this, this email, a phone number, a nickname, they're all alluding to some type of overlap between these records, but they're not something that we can just honestly go ahead and do hundred percent merge records based off these. Now, of course, for having that you, of course, then need to investigate things like inference engines to build up, you know, confidence on how confident Can I be that a person's nickname is is is unique in the reference of the data that we've plugged in these three or four data sources that I'm talking. And then finally, the edges that they're placed, essentially, and they're there to be able to build referential integrity. But what I find architecturally is that when we're trying to solve data problems for companies, and majority of the time, their model represents much more network than the classic relational store or column, the database or documents store. And so when we look at the technology that's that's needed to, you know, support the system architecture, one of the key technologies at the heart of this is a graph database. And to be honest, it doesn't really matter which graph database you use. But it is kind of what we found important is that it needs to be this a native grass store. There are triplet stores out there, there are multi mode databases like Cosmos DB and SAP HANA, but what we found is that you really need a native graph to be able to do this properly. So the way that you can conceptualize the pattern is that every record that we pull in from a system or that you import, it will go into this network or graph as a node. And every entity code for that record, ie, the unique ID, or multiple unique IDs of that record, they will also go in as a node connected to that record. Now, every alias will go in as a property on that original node, because we want to probably do some processing later to figure out if we can turn that alias into one of these entity codes or these unique references. And here's the interesting part, this is the this is the part where the eventual connectivity pattern kicks in all the edges, I if I was, you know, referencing a person to accompany that that person works at a company. Now those edges are placed into the graph. And a node is created, but it's marked as hidden. Now we call those shadow nodes. So you can imagine if we brought in a record on, on Barack Obama, and a tad barracks or phone number, now, that's not a unique reference. But what we would do is we would create a record a node in the graph, that's referring to that phone number, link it to Obama, but mark the phone number node as hidden, as I said, Before, we call the shadow nose. And essentially, you can see that as one of these hooks that, ah, if I ever get other records that come in later that also have an alias, or an entity code that overlap. That's where I might need to start doing my merging work. And what we're hoping for. And this is what we see time after time as well, is that as we import system, once data, it'll start to come in, and you'll see a lot of nodes that are the shadow nodes, ie, I have nothing to hook onto on the other side of this, this ID. And the analogy that kind of we use for this this shadow node is that, you know, records come in there by default, a clue. So a clue is in no way, factual, in no way Do we have any other records the correlating to the same values. And our goal is to turn in this eventual connectivity pattern, clues to facts. And what makes facts is records that have the same entity code that exists across different systems. So the architectural key patterns to this is that a graph store needs to be there to model out data. And here's one of the key reasons. If I realized that the landing zone of this integrated data was a relational database, I would need to have an upfront schema, I would need to specify how these objects connect to each other. What I've always found in the past is that when I need to do this, it becomes quite rigid. Now, I believe I'm a strong believer in every database needs a schema at some point, or you can't scale these things. But what's nice about the graph is that one of the things that got really old design patterns that have got really well was flexible data modeling, there is no necessarily more important object that exists within the graph structure. And they're all equal in their complexity, but also in their importance, and really pick and choose the graph database that you want. But it's one of the keys to this architectural path. So
Tobias Macey
0:30:07
one of the things that you talked about in there is the fact that there's this flexible data model. And so I'm wondering what type of upfront modeling is necessary in order to be able to make this approach viable? I know, you talked about the idea of the entity codes and the aliases. But for somebody who is taking a source data system and trying to load it into this graph database in order to be able to take advantage of this eventual connectivity pattern, what is the actual process of being able to load that information in and assign the appropriate attributes to the different records and do the different attributes in the record? And then also, I'm wondering if there are any limitations in terms of what the source format looks like, as far as the serialization format or the types of data that this approach is viable? For?
Unknown
0:31:03
Sure. Good question. So I think the first thing is, is to identify that the eventual connectivity pattern and modeling it in the graph, the key to this is that there will always be extra modeling that you do after this step. And the reason why is because if you think about the data structures that we have, as engineers, the network or the graph is the highest fidelity data structure we have. It's hot, it's a higher or more detailed structure than a tree, it's more structured than a hierarchy, or relational stone, definitely more, we have more structure or fidelity, then something like a document. With this in mind, we use the eventual connectivity to solve this piece of integrating data from different systems and modeling it. But we know we will always do better modeling for the purpose fit case later. So it's, it's worth highlighting that the value of the eventual connectivity patent is that it makes the integration of data easier, but this will definitely not be the last modeling that you would have. And therefore this allows flexible modeling. Because you always know, hey, if I'm trying to build a data warehouse, based off the data that we've modeled in the graph, you're always going to do extra processes after it to model it into the probably the relation store for a data warehouse or a column, you're going to model it purpose fit to solve that problem. However, if what you're trying to achieve with your data is flexible access to data to be able to feed off to other systems, you want the highest fidelity, and you want the highest flexibility in modeling. But the key is that if you were to drive your data warehouse directly off this graph, it would do do a terrible job. That's not what the graph was purpose built for the graph was always good at flexible data modeling, it's always good at being able to join records very fast. And I mean, just as fast as doing an index look up. That's how these native graph stores have been designed. And so it's important to highlight that in the upfront modeling, really, it's not a lot of upfront modeling. Of course, we shouldn't do silly things. But I'll give you an example. If I was modeling a skill, a person and the company, it's completely fine to have a graph where the skill points to the person, and the person points to the organization. And it's also okay to have that the person points to the skill and the skill points to the organization. That's not as important. What's important at this stage is that the eventual connectivity pattern allows us to integrate data more easily. Now, when I get to the point where I want to do something with that data, I might find that, yes, I actually do need an organization table, which has a foreign key to person, and then person has a foreign key skill. And that's because, you know, that's typically what a data warehouse is built to do. It's to model the data perfectly. So if I have a billion rows of data, this report still runs fast. But we lose that kind of
0:34:34
that flexibility in the data modeling Now, as for formats and things like that, what I found is that to some degree that the formatting and and the source data, where you could probably imagine the data is not created equally, right. But so for many different systems, they'll allow you to do absolutely anything you want. And where the kind of ATL approach allows you to, to, you know, or kind of dictates that you capture these exceptions up front of if I've got a certain looking data coming in, how does it connect to the other systems, what eventual connectivity does is it just catches them later in the process. And my thoughts on this is that, to be honest, you will never know all these business rules up front, and therefore kind of let's embrace an integration pattern that says, hey, if the schema in the source or the format of the data changes, you kind of alluded to this before as well, Tobias is okay got it, I want to be alerted that there is an issue with dc dc realizing this data, I want to start queuing up the data in a message queue or maybe a stream. And I want to be able to fix that schema and a platform to be able to say, got it now that that's fixed, I'll continue on with see realizing the things that will that will now serialized and these kind of things will happen all the time, I think I've referred to it before and heard other people refer to it as schema drift. And this will always happen in source and in target. So what I found success with is embracing patterns, where failure is going to happen all the time. And when we look at the ATL approach, it's much more of a when things go wrong, everything stops, right that the different workflow stages that we've put into our kind of classical ATL designers, they all go read, read, read, read, I have no idea what to do. And I'm just going to kind of fall over. And so what we would rather is a pattern where it says, got it scheme has changed, I'm going to load up what you need to do until the point where you've changed that schema. And when you put that in place also, I'll test it outside. Yeah, that schema that seems to be I can see realize that now I'll continue on. And what I find is that if you don't embrace this technology spend most of your time in just re processing data through ETFs.
Tobias Macey
0:37:13
And so it seems that there actually is still a place for workflow engines or some measure of ATL where you're extracting the data from the source systems. But rather than loading it into your data lake or your data warehouse, you're adding it to the graph store for then being able to pull these mappings out and then also potentially going from the graph database where you have joined the data together, and then being able to pull it out from that using some sort of query to be able to have the maps data extracted, and then load that into your eventual target.
Unknown
0:37:49
I mean, what you've just described, there is a workflow, and therefore, you know, the workflow systems, they still make sense, they're very logical to look at these at these workflows and say, oh, that happens, then that happens, then that happens, they completely still make sense. And I still actually use in some cases, I actually still use the ZTL tools for very specific jobs. But what you can see is if we were to use these kind of classical workflow systems, and you can see the eventual connectivity pattern, as you described, it's just one step in that overall pattern, that I think what I found over time is that, no, we use these workflow systems to be able to join data. And I would, I would actually rather throw it to a an individual step called eventual connectivity, where it does the joining and things like that for me, and then continue on from there, you could very similar to the kind of example you gave is, and that I've also been been mentioning here as well as there will always be things you do after the graph. And that is something you could easily push into one of these workflow designers. Now, as for an example of, you know, the the times when our company still uses these, these these tools out of our customers, I think one of the ones that makes complete sense is IoT data. And it's mainly because it's not typical, in at least the cases that we've seen, that there's as much hassle in blending and cleaning data, we see that more with, you know, operational data, things like transactions, and, you know, customer data and customer records, that typically quite hard to blame. But when it comes to IoT, IoT data, you know, if there's something wrong with the data that it can't blend, it's often that, well, maybe it's a bad reader that we're, you know, reading off instead of something that is actually dirty data. Now, of course, every now and then, if you've worked in in that space, you'll realize that, you know, readers can lose a network can make and, you know, have holes in the data. But I mean, eventually connectivity would not solve that either, right. And typically, in those cases, you'll do things like impute the values from historical and future data to fill in the gaps. And it's always a little bit of a guess that's why it's, it's it's way puting it. But, to be honest, if it was my task to build a unified data set from across different sources, I would just choose this eventual connectivity pattern every single time, if it was to have to a workflow of data processing, where I, I know that data blends easy, then there's not a data quality issue, right? Where there is, you need to jump across multiple different systems to merge data, I just, Time after time have found that, you know, these workflow systems, they reach their limit where it just becomes too complex.
Tobias Macey
0:40:53
And for certain scales or varieties of data, I imagine that there are certain cases that come up when trying to load everything into the graph store. And so I'm wondering what you have run up against as far as limitations to this pattern, or at least alterations to the pattern to be able to handle some of these larger volume tasks.
Unknown
0:41:15
I think I'll start with this, the graph is notoriously hard to scale. And most of the graph databases that I've had experience with and you're essentially bound to one big graph, so there's no i, there's no kind of idea of clustering these data stores with, you know, some graphs that you could query a cost. So scaling that is actually quite hard to start with. But I think the limitations from the pattern itself, there are many, I mean, it starts with the fact that you need to be careful, I'll give you a good example, I've seen many companies that use this pattern, and they flag something like an email is unique. And then we realized later modern, no, it's not, we have merged records that are not duplicates. And this means, of course, that you need support in the platform, that you're you're utilizing the ability to split these records and fix them and reprocess them at a later point. But I mean, these are also things that will be very hard to pick up an EGLLT types of battles. But I think one of the other you know, downsides of this approach is that up front, you don't know how many records will join your kind of like the name alludes to, eventually, you'll get joins or connectivity. And you can think of it as this pattern will decide how many records it will join for you based off these entity codes or unique references, all the power of your inference engine, when it comes to things that are a little bit fuzzy, unique, a fuzzy ID to someone things like you know, phone numbers or things like that. The great thing about this, it also means that you don't need to choose what type of join that you're doing. I mean, in the relational world, you've got plenty of different types of joins, your Inner Joins, outer joins, left outer Left, Right outer joins things like this. In the graph, there's one join, right? And so with that pattern, you know, it's not like you can pick the wrong join to go with there's only one type of thing. So if it really becomes useful when No, no, I'm just trying to merge these records, I don't need to hand hold how the joins will happen. I think one of the other downsides that I've had experience with this is that, let's just say you have, you know, system one, and two. And what you'll often find is that when you integrate these two systems, you have a lot of these shadow nodes in the graph, ie, sometimes we call them floating edges, where, hey, I've got a reference to accompany with an idea of 123. But I've never found the record on the other side with the same ID. So I have, you know, in fact, I'm storing lots of extra information that, you know what, I'm not actually utilizing it. But I think the advantages of saying, Yeah, but you will integrate system for you will integrate system five where their data sits. But the value is that you don't need to tell the system, how they join units need to flag these unique references. And I think that the kind of final maybe limitation or that i think that i found with these patterns is that you learn pretty quickly, as I alluded to, before, that there are many records in your data sources where you think a column is unique, but it's not
Tim Ward
0:44:45
it might be unique in your system,
Unknown
0:44:47
ie in Salesforce, the ID is unique. But if you actually look across the the other parts of the stack, you realize, no, no, there is a another company in another system with a record called 123. And they have nothing to do with each other. And so what we, you know, these entity codes that I've talking about, they're made up of multiple different parts they made up of the id 123. They're made up of a type something like organization, and they're made up of a source of origin, you know, Salesforce account 456. And so what this does is it guarantees uniqueness, if you added in to Salesforce accounts, or if you add in systems that have the same ID, but it came from a different source. And as I said before, good example would be the email. I mean, even at our company, we use GitHub enterprise to be able to store our our source code. And you know, out we have notifications that our engineers get when you know, there's pull requests and things like this. And it actually a GitHub identifies each employee as notifications at GitHub. com, that's what that record sends us as its unique reference. And of course, if I marked this as a unique reference, all of those employee records using this pattern would merge. However, what I like about this approach is that, you know, at least I'm given the tools to rectify that the bad data when I see it. And to be honest, if companies are wanting to become much more data driven as kind of we aim to help our customers with, and I just believe that it means we have to start to shift or learn to accept that there's more risk that could happen. But is that risk of having data, you know, more readily available to the forefront worth more than the old approaches that we're taking?
Tobias Macey
0:46:46
And for anybody who wants to dig deeper into this idea, or learn more about your thoughts on that, or some of the Jason technologies, what are some of the resources that you recommend they look to?
Unknown
0:47:00
So I mean, I guess the first thing and Tobias you and I have talked about this before is that, I think the first thing that that the white to kind of learn more about it is to kind of get in contact and kind of challenge us on this idea. I mean, when you you know, when you see a technology and you're an engineer, you go out and start using it, you have this tendency to kind of gain a bias around that, that, you know, Time after time you see it working. And then you you think, why, why is not everybody else doing this? And actually, the answer is quite clear. It's because well, things like graph databases, were not as ubiquitous as they are right now. You know, you can get off the shelf free graph databases to use and, you know, 1010, even 10 years ago, this was just not the case, you would have to build these things yourself. And so I think the first thing is, you know, you can get in touch with me at TIW included.com, if you're just interested in challenging this, this design pattern, and really getting to the crux of, really is this something that we can replace ATL with completely, I think the other thing you mentioned the white paper that you alluded to, that's available from our website. So you can always jump over to clued in.com, to read that white paper, it's completely open and free to everyone to read. And then we also have a couple of YouTube videos. If you just type, including I'm sure you'll find them. And where we talk in depth about, you know, utilizing the graph to be able to merge different data sets. And we really go into depth. And but I mean, I always like to talk to other engineers and have them challenge me on this. So feel free to get in touch. And I guess if you're wanting to learn much more, we also have our developer training that we give here, including which, you know, we compare this pattern towards, you know, other patterns that are out there, and you can get hands on experience with taking different data sources, taking the multiple different approaches is that are out there is integration patterns, and really just seeing the one that works for you.
Tobias Macey
0:49:04
Is there anything else about the ideas of eventual connectivity or ATL patterns that you have seen, or the overall space of data integration that we didn't discuss yet that you'd like to cover? Before we close out the show?
Unknown
0:49:16
I think for me, I always like when I have more engineering, patents and tools on my tool belt. So I think for me, the thing for listeners to take from this is that use this as an extra little piece on your tool belt, if you find that you walk into, you know, a company that you're helping, and they say, Hey, listen, we're really wanting to start to do things with our data. And they say, yeah, we've got, we got 300 systems. And to be honest, I've been given the direction to to kind of pull and wrangle this into something we can use, really think about this eventual connectivity pattern really investigated as a possible option, it's actually that to implement the pattern you can, you'll be able to see it in the white paper. But to implement the pattern yourself, it's really not complex. It just like I said before, one of the keys is to just embrace maybe a new database family to be able to model your data. And yes, get get in touch if you need any more information on.
Tobias Macey
0:50:22
And one follow on from that, too, I think is the idea of migrating from an existing ETL workflow and into this eventual connectivity space. And it seems that the logical step would be to just replace your current target system with the graph database and adding in the mapping for the entity IDs and the aliases. And then you're sort of at least partway on your way to being able to take advantage of this and then just adding a new ATL or workflow at the other end to pull out of the connected data into what you original target systems were. Yeah, exactly.
Unknown
0:51:02
I mean, it's it's, it's, it's quite often we walk into a business, and they've already got many years of business logic embedded into these ETFs pipelines. And, you know, my, my idea on that is not to just throw these away, there's a lot of good stuff there. It's really just complemented with this extra design pattern. And that's probably a little bit better at the whole merging and DJ application of data.
Tobias Macey
0:51:29
All right? Well, for anybody who wants to get in touch with you, I'll add your email and whatever other contact information to the show notes. And I've also got a link to the white paper that you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Unknown
0:51:49
Well, I would say a little bit off topic, but I would actually see, say that I'm amazed how many companies please I walk into and they don't know, what is the quality of the data they are working with. So I think one of the big gaps that needs to be fixed in the data management market is to be able to integrate data from different sources to be explicitly told via different metrics. I mean, the classic ones that were used to be accuracy and completeness and things like this, for businesses to know, what are they dealing with? I mean, just that simple fact of knowing, hey, we're dealing with 34% accurate data. And guess what, that's what we're pushing to the data warehouse to build reports, and that our management is making key decisions off. So I think, first of all the gap is in knowing what quality of data you're dealing with. And I think the second piece is in facilitating the process around how do you increase that a lot of these things can often be fixed by normalizing values, you know, if I've got two different names for a company, but thou the same record, which one do you choose? And do we normalize to the valley that's uppercase, or lowercase or title case, or the one that has a, you know, Incorporated at the end? And I think that that part of the industry does need to get better.
Tobias Macey
0:53:20
All right. Well, thank you very much for taking the time today to join me and discuss your thoughts on differential connectivity and some of the ways that it can augment or replace some of the ETFs patterns that we have been working with up to date. So I appreciate your thoughts on that. And I hope you enjoy the rest of your day.
Tim Ward
0:53:37
Thanks, Tobias.

The Workflow Engine For Data Engineers And Data Scientists - Episode 86

Summary

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Prefect is and your motivation for creating it?
  • What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
  • In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
  • How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
  • How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
  • What was your decision making process when deciding to use Dask as your supported execution engine?
    • For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
  • Does Prefect support managing tasks that bridge network boundaries?
  • What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
  • What are the limitations of the open source core as compared to the cloud offering that you are building?
  • What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
  • When is Prefect the wrong choice?
  • In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
    • What are some best practices and industry trends that you are most excited by?
  • What do you have planned for the future of the Prefect project and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Evolving An ETL Pipeline For Better Productivity - Episode 83

Summary

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
  • Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
    • What are your primary sources of data and what are the targets that you are loading them into?
  • What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
    • What were your criteria for your replacement technology and how did you gather and evaluate your options?
  • Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
    • What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
    • What were the big wins?
  • What was your evaluation framework for determining whether your re-engineering was successful?
  • Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
    • If you have freed up time for your engineers, how are you allocating that spare capacity?
  • What do you hope to see from DataCoral in the future?
  • What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Build Your Data Analytics Like An Engineer - Episode 81

Summary

In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DBT is and your motivation for creating it?
  • Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline?
  • Can you talk through the workflow for someone using DBT?
  • One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented?
  • The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?
    • Are these packages driven by Fishtown Analytics or the dbt community?
  • What are the limitations of modeling everything as a SELECT statement?
  • Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?
    • What are your thoughts on higher level approaches to SQL that compile down to the specific statements?
  • Can you explain how DBT is implemented and how the design has evolved since you first began working on it?
  • What are some of the features of DBT that are often overlooked which you find particularly useful?
  • What are some of the most interesting/unexpected/innovative ways that you have seen DBT used?
  • What are the additional features that the commercial version of DBT provides?
  • What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT?
  • When is it the wrong choice?
  • What do you have planned for the future of DBT?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Serverless Data Pipelines On DataCoral - Episode 76

Summary

How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Raghu Murthy about DataCoral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DataCoral is and your motivation for founding it?
  • How does the data-centric approach of DataCoral differ from the way that other platforms think about processing information?
  • Can you describe how the DataCoral platform is designed and implemented, and how it has evolved since you first began working on it?
    • How does the concept of a data slice play into the overall architecture of your platform?
    • How do you manage transformations of data schemas and formats as they traverse different slices in your platform?
  • On your site it mentions that you have the ability to automatically adjust to changes in external APIs, can you discuss how that manifests?
  • What has been your experience, both positive and negative, in building on top of serverless components?
  • Can you discuss the customer experience of onboarding onto Datacoral and how it differs between existing data platforms and greenfield projects?
  • What are some of the slices that have proven to be the most challenging to implement?
    • Are there any that you are currently building that you are most excited for?
  • How much effort do you anticipate if and/or when you begin to support other cloud providers?
  • When is Datacoral the wrong choice?
  • What do you have planned for the future of Datacoral, both from a technical and business perspective?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Cleaning And Curating Open Data For Archaeology - Episode 68

Summary

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

  • Introduction

  • How did you get involved in the area of data management?

    I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

  • Can you start by describing what Open Context is and how it started?

    Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

  • What are your protocols for determining which data sets you will work with?

    Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

  • What are some of the challenges unique to research data?

    • What are some of the unique requirements for processing, publishing, and archiving research data?

      You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

      Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

  • How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

    We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

  • Can you describe the system architecture that you use for Open Context?

    Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

  • What is the process for cleaning and formatting the data that you host?

    • How much domain expertise is necessary to ensure proper conversion of the source data?

      That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.

    • Can you discuss the challenges that you face in maintaining a consistent ontology?

    • What pieces of metadata do you track for a given data set?

  • Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?

    • Can you walk through the lifecycle of a given data set?
  • Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?

  • Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?

  • What are some of the most interesting uses you have seen of the data that is hosted on Open Context?

  • What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?

  • What are your goals for the future of Open Context?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of a data pipeline?
    • At what point in the life of a project or organization should you start thinking about building a pipeline?
  • In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
    • What metrics/use cases should you be optimizing for at this point?
  • What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
    • How do the design requirements for a data pipeline change as you reach this stage?
    • What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
  • What are some of the changes that are necessary as you move to a large scale data pipeline?
  • At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
  • In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
  • When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
  • Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
  • What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
    • How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
  • What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
  • What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
  • What are your plans for improving your current pipeline at Grubhub?
  • What are some references that you recommend for anyone who is designing a new data platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Spark is?
    • What are some of the main use cases for Spark?
    • What are some of the problems that Spark is uniquely suited to address?
    • Who uses Spark?
  • What are the tools offered to Spark users?
  • How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
  • For someone building on top of Spark what are the main software design paradigms?
    • How does the design of an application change as you go from a local development environment to a production cluster?
  • Once your application is written, what is involved in deploying it to a production environment?
  • What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
  • What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
  • What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
  • What are the limitations of the Spark programming model?
    • What are the cases where Spark is the wrong choice?
  • What was your motivation for writing a book about Spark?
    • Who is the target audience?
  • What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
  • What advice do you have for anyone who is considering or currently using Spark?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

  • Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA