Data Integration

Simplifying Data Integration Through Eventual Connectivity - Episode 91

Summary

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
  • What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
  • In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
    • How do different implementations of graph databases impact their viability for this use case?
  • Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
  • How much up-front modeling is necessary to make this a viable approach to data integration?
  • How do the volume and format of the source data impact the technology and architecture decisions that you would make?
  • What are the limitations or edge cases that you have found when using this pattern?
  • In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
  • What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage in the 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities within startups that are changing the world than Angel list is the place to go go to data engineering podcast.com slash angel to sign up today. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing Tim Ward about his thoughts on eventual connectivity is a new pattern to replace traditionally to and just as a full disclosure, Tim is the CEO of clued in who is a sponsor of the podcast. So Tim, can you just start by introducing yourself?
Tim Ward
0:02:09
Yeah, sure. My name is Tim board. As Tobias said, I'm the CEO of a data platform called clued in. I'm based out of Copenhagen, Denmark, I have with me my wife, my little boy, Finn, and a little dog that looks like and he walk called Seymour.
Tobias Macey
0:02:29
And do you remember how you first got involved in the area of data management?
Unknown
0:02:32
Yeah, so I mean, I'm, I'm, I guess, a classically trained software engineer, I've been working in software space for around 1314 years now, I've been predominantly working in the web space, but mostly for enterprise businesses. And around, I don't know, maybe six or seven years ago, I was given a project, which was in the space of what's called multivariate testing, it's the idea that if you you've got a website, and maybe the homepage of a website, if we make some changes, or different variations, which which variation works better for the amount of traffic that you're wanting to attract, or maybe the amount of purchases that the company makes on the website. So I mean, using this, that was my first foray into, okay, so this involves me having to capture data on analytics, that then took me down this rabbit hole of realizing, got it, I have to not only get the analytics from the website, but I need to correlate this against, you know, back office systems, CRM systems and, you know, ERP systems and pin systems. And I kind of realized, Oh, my God, this becomes quite tricky with the integration piece. And once I went down that rabbit hole, I realized all for me to actually start doing something with this data, I need to clean it, I need to normalize it. And, you know, basically, I got to this point where I realized what data is kind of a hard thing to work with, it's not something you can pick up and just start getting value out of straight away. So that's kind of what led me into the the path of around four half, five years ago saying, you know what, I'm going to get into this data space. And ever since then, I've just enjoyed immensely being able to help large enterprises in becoming a lot more data driven.
Tobias Macey
0:04:31
And so to frame the discussion a bit, I'm wondering if you can just start by discussing some of the challenges and shortcomings that you have seen in the existing practices of ET? Oh,
Unknown
0:04:42
yes, sure. I mean, I guess I want to stop by not trying to be that grumpy old man that's yelling at all technologies. And I'm always this person. That is one thing I've learned in my Korea is that it's very rare that a particular technology or approach is right, or wrong. It's just right, for the right use case. And I mean, also, you're seeing a lot more patterns in in integration emerge, of course, we've got the ATL that's been around forever, you've got this LT approach, which has been emerging over the last few years. And then you've kind of seen streaming platforms also take up the idea of joining streams of data instead of something that is kind of done up front. And, and, you know, to be honest, I've always wondered, with ATL, how on earth are people achieving this for an entire company, you know, like ATL for me has always been something that if you've got 234 tools to be able to integrate, it's a fantastic kind of approach, right? But you know, now we're definitely dealing with a lot more data sources, and the demand for having free flowing data available is becoming much more apparent. And it was to the point where I thought, am I the stupid one like, I can't, if I have to use ATL to integrate data from multiple sources, as soon as we go over a certain limit of data sources, the problem just exponentially becomes a lot harder. And I think the thing that I found interesting as well with this ATL approach is that typically, once the data was processed through these classic, you know, designers, workflow gags, you know, directed a cyclical graphs. And the output of this process was typically, oh, I'm going to store this in a relational database. And therefore, you know, I can understand why ETFs existed, I can understand that. Yeah, if, if you know what you're going to do with your data after this ATL process, I mean, classically, you would go into something like a data warehouse, I can see why that existed. And I think I think there's just different types of demands that are in the market today, there's much more need for, you know, flexibility and access to data, and not necessarily data that's been modeled as rigidly as you do get in the kind of classical data warehouses. And I kind of thought, well, the relational database is not the only database available to us, as engineers, and one of the ones that I've been focusing on for the last few years is this graph database. And I kind of when you think about it, most problems that we're trying to solve in the modeling world today, they are actually a network, they are a graph, they're not necessarily a relational or a kind of flat document store. So I thought, you know, this seems more like the right store to be able to model the data. And I mean, I think the second thing was that, just from being hands on, I found that this ATL process, what it meant was that when I was trying to solve problems and integrate data up front, I had to know what we're all about business rules that dictated how the systems integrate, but what dictated clean data, and you probably Tobias used to these ETFs, designers, where I get these built in functionalities to do things like, you know, trim white space, and tokenize, the text and things like that, and you think, yet, but I need to know up front, what is considered a bad ID or a bad record, you're probably also used to seeing, you know, we've got these IDs, and sometimes it's, it's a beautiful looking ID and sometimes it's negative one, or na or, you know, placeholder or hyphen, and you think I've got it up front in the ATL world define what are all those possibilities before I run my ATL job, and I just found this quite rigid in its approach. And, and I think the king kind of game changer for me was that when I was using ATL and these classes, designers to integrate more than five systems, I realized how long up front, it took that I needed to go around the different parts of the business and have them explain. Okay, so how does the Salesforce lead table connect to the market early table? Like, how does it do that, and then time after time, after, you know, weeks of investigation, I would realize, Oh, I have to jump to the I don't know, the exchange, the exchange server or the Active Directory to get the information that I need to join those two systems. And it just, it just resulted in the spaghetti of point to point integrations. And I think that's one of the key things that ETFs suffers from is that it puts us in an architectural design thinking pattern of Oh, how am I going to map systems point to point and I can tell you after working in this industry for five, five years so far, that systems don't naturally blend together point to point.
Tobias Macey
0:10:04
Yeah, your point to about the fact that you need to understand what are all the possible representations of a no value means that in order for a pipeline to be sufficiently robust, you have to have a fair amount of quality testing built in to make sure that any values that are coming through the system map to some of the existing values that you're expecting, and then be able to raise an alert when you see something that's outside of those bounds, so that you can then go ahead and fix it. And then being able to have some sort of a dead letter Q or bad data queue for holding those records, until you get to a point where you can reprocess them, and then being able to go through and back populate the data. So it definitely is something that requires a lot of engineering effort in order to be able to have something that is functional for all the potential values. And also there's the aspect of schema evolution and being able to figure out how to propagate that through your system and have your logic, flexible enough to be able to handle different schema values for cases where you have data flowing through that is at the transition boundary between the two schemas. So certainly a complicated issue. And so you recently released a white paper, discussing some of your ideas on this concept of eventual connectivity. And so wondering if you can describe your thoughts on that and touch on how it addresses some of the issues that you've seen with the more traditional ATL pattern.
Unknown
0:11:38
Yeah, sure. I mean, I think one of the concepts that behind this pattern, we've kind of named eventual connectivity and is the it there's there's a couple of fundamental things to understand. First of all, it's a it's a pattern that essentially embraces the idea that we can throw data into a store. And as we continue to throw more data records will find out itself how to be merged. And it's the idea of being able to place records into this kind of central place this central repository with little hooks, with little hooks that are flags that are indicating, hey, I'm a record. And here are my unique references. So, you know, obviously with the idea being that as we bring in more systems, those other records will say, Hey, I actually have the same ID. Now, that might not happen up front, it might be after you've integrated system 123456, that system, two and four are able to say, Hey, I now have the missing pieces to be able to merge our records. So in an eventual connectivity world. What this this really brings in advantages is that, first of all, if I'm trying to integrate systems, I only need to take one system at a time. And I found it rare in the enterprise that I could find someone who understood the domain knowledge behind their Salesforce account, and their Oracle, Oracle account or and their Marketa account, I would often run into someone who completely understood the business domain behind the Salesforce account. And for the reason I'm using that as an example is because Salesforce is an example of a system where you can do anything in it, you can add objects, that, uh, you know, animals are dinosaurs, not just the ones that are out of the box, I don't know who's selling to dinosaurs. But essentially, what this allows me to do is when I walk into an integration job, and that business says, Hey, we have three systems, I say, got it. And if they say, Oh, sorry, that was actually 300 systems, I go, God, it makes no difference. To me, it's only a time based thing, the complexity doesn't get more complex because of this type of pattern that we're taking. And I'll explain the pattern. Essentially, what we do is we, you can conceptualize it, as we go through a system, a record at a time or an object at a time, let's take something like leads or contacts. And the patent basically asks us to highlight what our unique references to that object. So in the case of a person, it might be something like a passport number, it might be, you know, a local personal identifier. And you know, in Denmark, we have what's called the CPI number, which is a unique reference to me, no one else in Denmark and have the same number, then you get to things like emails, and what you discovered pretty quickly in, in enterprise, in enterprise data world is that email in no way as a unique identifier of an object, right, we can have group emails that refer to multiple different people. And, you know, not all systems will specify as if this is a group email of this is a an email referring to an individual. So the pattern asks us or dictates us to mock those references as aliases, something that could allude to a unique identifier of an object. And then when we get to the referential pieces, so imagine that we have a contact that's associated with a company, you could probably imagine that as a call a column in the contact table, that's called company ID. And the key thing with the eventual connectivity pattern is that, although I want to highlight that as a unique reference to another object, I don't want to tell the the integration pattern where that object exists, I don't want it to tell that it's in the Salesforce organization table. Because to be honest, if that's a unique reference, that unique reference my exist in other systems. And so what this means is that I can take an individual system at a time and not have to have this standard Point to Point type of relationship between data. And I think if I was a highlight kind of three main wins that you get out of this, I think the first is that it's quite powerful to walk into a large business and say, hey, how many systems do you have? Well, we have 1000. And I think, good, when can we start? Now if I was in the ATL approach, I will be thinking, Oh, God, are we can we actually honestly do this, like,
0:16:40
as you could probably know, yourself, Tobias, often, we go into projects with big smiling faces. And then when you see the data, you realize, Oh, this is going to be a difficult project. So that advantage of being able to walk in and say I don't care how many systems you have, it makes not a lot of complexity difference to me. I think the other pieces that the eventual connectivity pattern addresses this idea of that you don't need to know all the business rules up front of what, how systems connect to each other, but then what's considered bad data versus good data. And in rather that, you know, we let things happen, and we have a much more reactive approach to be able to rectify them. And I think this is more cognizant, or it's more representative of the world we look into, that we live in today, companies are wanting more real time data to their consumers or to the consumption technologies, where we get the value things like business intelligence, etc. And they're not willing to put up with these kind of rigid approaches of Oh, detail processes broken down, I need to go back to our design, I need to update that and run it through and make sure that we we guarantee that, you know, the data is in the perfect order. Before we actually do the merging. I think the final thing that has become obvious time after time, where I've seen companies use this pattern is that this eventual connectivity pattern will discover joins, where it's really hard for you and me to sit down and figure out where these joins are. And I think it comes back to this core idea that systems don't connect well point to point, there's not always a nice ID that are this ubiquitous ID that we can just join two systems together, often we have to jump in between different data sources, to be able to wrangle this into a unified type of set of data. Now, at the same time, I can't deny that, you know, like, there's quite a lot of work that's going on in the field of, you know, ETFs, you've got platforms like Nye phi, and air flow, and you know what, those are still very valid, they're still, you know, they're very good at moving data around, they're fantastic at breaking down a workflow into these kind of just Greek components that can, in some cases, play independently. I think that the eventual connectivity patent for us time after time has allowed us to blend systems together without this overhead of complexity. And Tobias, there's not a big enough whiteboard in the world, when it comes to integrating, you know, 50 systems, it, you just have to put yourself in that situation, realize, oh, wow, the old ways of doing it, I just not scaling.
Tobias Macey
0:19:31
And as you're talking through this idea of eventual connectivity, I'm wondering how it ends up being materially different from a data lake where you're able to just do the more ELT pattern of just ship all of the data into this repository without having to worry about modeling it up front and understanding what all the mappings are, and then doing some exploratory analysis after the fact to be able to then create all of these connection points between the different data sources and do whatever cleaning happens after the fact.
Unknown
0:20:03
Yeah, I mean, you one thing I've gathered in my career, as well as that, you know, something like a an overall data pipeline for a business is going to be made up of so many different components. And in our world, in the in the eventual connectivity world, the light still makes complete sense to have, I see the lake as a place to dump data. There, I can read it in a ubiquitous language, in most cases, its sequel that it's exposed, you know, I don't know a single person in our industry that doesn't know sequel to some perspective. So that that is fantastic to have that light there. Where I see the problem often evolving, is that the Lakers is obviously kind of a place where we would typically store raw data. It's where we abstract away the complexity that Oh, now I need if I need data from a SharePoint site, I have to learn the SharePoint API. No. But though, the like is there to basically say, that's already been done. I'm going to give you sequel, that's the way that you're going to get this data, what I find is that when I look at the companies that we work with, is that, yes, but there's a lot that needs to be done from the lake, to where we can actually get the value. I think something like machine learning is a good example, Time after time we hear and it's true that machine learning machine learning doesn't really work that well, if you're not working with good quality, well integrated data that is complete. So it's missing, you know, novels, and it's missing empty columns and things like that. And what I found is that we went through this in our industry, we went through this this period, where we said, okay, well, the lake is going to give the data science teams and the different teams direct access to the role. And what we found is that every time they tried to use that data, they went through the common practices of now I need to blend it. Now I need to catalog now I need to normalize it and clean it. And you could see that the eventual connectivity pattern is there to say, No, no, this is something that sits in between the lake that matures, the data to the point where it's already blended. And that's one of the biggest challenges I kind of see there is that, you know, if I get, you know, a couple of different files out of, of the lake, and then I go to investigate how this joins together, I still have this, you know, this experience of all, this doesn't easily blend together. So then I go on this exploratory this discovery phase of what other datasets Do I have to use to string these two systems together? And we would call it just like to eliminate that.
Tobias Macey
0:22:46
So to make this a bit more concrete for people who are listening and wondering how they can put this pattern into effect in their own systems, can you talk through an example system architecture and data flow for use case that you have done or at least experimented with to be able to put this into effect and how the overall architecture plays together to make this a viable approach? And how those different connection points between the data systems end up manifesting?
Unknown
0:23:17
Yeah, definitely I. So maybe it's good to use an example, imagine you have three or four data sources that you need to blend, you need to adjust it, you then need to usually merge the records into kind of a flat, flat, unified data set. And then you need to, you know, push this somewhere. So you might be a data warehouse, something like Big Query or redshift, etc. And the fact is that, you know, in today's world, that data also needs to be available for the data science team. And now it needs to be available for things like exploratory business intelligence. So when you're building your integrations, I think architecturally from up from a modeling perspective, the three things that you need to think about what we call entity codes, aliases, and edges, and those three pieces together is what we need to be able to map this properly into a graph store. And simply put an entity code is is kind of a absolute unique reference to an object, as I alluded to before, something like a passport number. And that's a unique reference to an object that by itself, just the passport number, and doesn't mean that it's unique across all the systems that you have at your workplace.
0:24:42
So the other is aliases. So aliases is more of like this, this email, a phone number, a nickname, they're all alluding to some type of overlap between these records, but they're not something that we can just honestly go ahead and do hundred percent merge records based off these. Now, of course, for having that you, of course, then need to investigate things like inference engines to build up, you know, confidence on how confident Can I be that a person's nickname is is is unique in the reference of the data that we've plugged in these three or four data sources that I'm talking. And then finally, the edges that they're placed, essentially, and they're there to be able to build referential integrity. But what I find architecturally is that when we're trying to solve data problems for companies, and majority of the time, their model represents much more network than the classic relational store or column, the database or documents store. And so when we look at the technology that's that's needed to, you know, support the system architecture, one of the key technologies at the heart of this is a graph database. And to be honest, it doesn't really matter which graph database you use. But it is kind of what we found important is that it needs to be this a native grass store. There are triplet stores out there, there are multi mode databases like Cosmos DB and SAP HANA, but what we found is that you really need a native graph to be able to do this properly. So the way that you can conceptualize the pattern is that every record that we pull in from a system or that you import, it will go into this network or graph as a node. And every entity code for that record, ie, the unique ID, or multiple unique IDs of that record, they will also go in as a node connected to that record. Now, every alias will go in as a property on that original node, because we want to probably do some processing later to figure out if we can turn that alias into one of these entity codes or these unique references. And here's the interesting part, this is the this is the part where the eventual connectivity pattern kicks in all the edges, I if I was, you know, referencing a person to accompany that that person works at a company. Now those edges are placed into the graph. And a node is created, but it's marked as hidden. Now we call those shadow nodes. So you can imagine if we brought in a record on, on Barack Obama, and a tad barracks or phone number, now, that's not a unique reference. But what we would do is we would create a record a node in the graph, that's referring to that phone number, link it to Obama, but mark the phone number node as hidden, as I said, Before, we call the shadow nose. And essentially, you can see that as one of these hooks that, ah, if I ever get other records that come in later that also have an alias, or an entity code that overlap. That's where I might need to start doing my merging work. And what we're hoping for. And this is what we see time after time as well, is that as we import system, once data, it'll start to come in, and you'll see a lot of nodes that are the shadow nodes, ie, I have nothing to hook onto on the other side of this, this ID. And the analogy that kind of we use for this this shadow node is that, you know, records come in there by default, a clue. So a clue is in no way, factual, in no way Do we have any other records the correlating to the same values. And our goal is to turn in this eventual connectivity pattern, clues to facts. And what makes facts is records that have the same entity code that exists across different systems. So the architectural key patterns to this is that a graph store needs to be there to model out data. And here's one of the key reasons. If I realized that the landing zone of this integrated data was a relational database, I would need to have an upfront schema, I would need to specify how these objects connect to each other. What I've always found in the past is that when I need to do this, it becomes quite rigid. Now, I believe I'm a strong believer in every database needs a schema at some point, or you can't scale these things. But what's nice about the graph is that one of the things that got really old design patterns that have got really well was flexible data modeling, there is no necessarily more important object that exists within the graph structure. And they're all equal in their complexity, but also in their importance, and really pick and choose the graph database that you want. But it's one of the keys to this architectural path. So
Tobias Macey
0:30:07
one of the things that you talked about in there is the fact that there's this flexible data model. And so I'm wondering what type of upfront modeling is necessary in order to be able to make this approach viable? I know, you talked about the idea of the entity codes and the aliases. But for somebody who is taking a source data system and trying to load it into this graph database in order to be able to take advantage of this eventual connectivity pattern, what is the actual process of being able to load that information in and assign the appropriate attributes to the different records and do the different attributes in the record? And then also, I'm wondering if there are any limitations in terms of what the source format looks like, as far as the serialization format or the types of data that this approach is viable? For?
Unknown
0:31:03
Sure. Good question. So I think the first thing is, is to identify that the eventual connectivity pattern and modeling it in the graph, the key to this is that there will always be extra modeling that you do after this step. And the reason why is because if you think about the data structures that we have, as engineers, the network or the graph is the highest fidelity data structure we have. It's hot, it's a higher or more detailed structure than a tree, it's more structured than a hierarchy, or relational stone, definitely more, we have more structure or fidelity, then something like a document. With this in mind, we use the eventual connectivity to solve this piece of integrating data from different systems and modeling it. But we know we will always do better modeling for the purpose fit case later. So it's, it's worth highlighting that the value of the eventual connectivity patent is that it makes the integration of data easier, but this will definitely not be the last modeling that you would have. And therefore this allows flexible modeling. Because you always know, hey, if I'm trying to build a data warehouse, based off the data that we've modeled in the graph, you're always going to do extra processes after it to model it into the probably the relation store for a data warehouse or a column, you're going to model it purpose fit to solve that problem. However, if what you're trying to achieve with your data is flexible access to data to be able to feed off to other systems, you want the highest fidelity, and you want the highest flexibility in modeling. But the key is that if you were to drive your data warehouse directly off this graph, it would do do a terrible job. That's not what the graph was purpose built for the graph was always good at flexible data modeling, it's always good at being able to join records very fast. And I mean, just as fast as doing an index look up. That's how these native graph stores have been designed. And so it's important to highlight that in the upfront modeling, really, it's not a lot of upfront modeling. Of course, we shouldn't do silly things. But I'll give you an example. If I was modeling a skill, a person and the company, it's completely fine to have a graph where the skill points to the person, and the person points to the organization. And it's also okay to have that the person points to the skill and the skill points to the organization. That's not as important. What's important at this stage is that the eventual connectivity pattern allows us to integrate data more easily. Now, when I get to the point where I want to do something with that data, I might find that, yes, I actually do need an organization table, which has a foreign key to person, and then person has a foreign key skill. And that's because, you know, that's typically what a data warehouse is built to do. It's to model the data perfectly. So if I have a billion rows of data, this report still runs fast. But we lose that kind of
0:34:34
that flexibility in the data modeling Now, as for formats and things like that, what I found is that to some degree that the formatting and and the source data, where you could probably imagine the data is not created equally, right. But so for many different systems, they'll allow you to do absolutely anything you want. And where the kind of ATL approach allows you to, to, you know, or kind of dictates that you capture these exceptions up front of if I've got a certain looking data coming in, how does it connect to the other systems, what eventual connectivity does is it just catches them later in the process. And my thoughts on this is that, to be honest, you will never know all these business rules up front, and therefore kind of let's embrace an integration pattern that says, hey, if the schema in the source or the format of the data changes, you kind of alluded to this before as well, Tobias is okay got it, I want to be alerted that there is an issue with dc dc realizing this data, I want to start queuing up the data in a message queue or maybe a stream. And I want to be able to fix that schema and a platform to be able to say, got it now that that's fixed, I'll continue on with see realizing the things that will that will now serialized and these kind of things will happen all the time, I think I've referred to it before and heard other people refer to it as schema drift. And this will always happen in source and in target. So what I found success with is embracing patterns, where failure is going to happen all the time. And when we look at the ATL approach, it's much more of a when things go wrong, everything stops, right that the different workflow stages that we've put into our kind of classical ATL designers, they all go read, read, read, read, I have no idea what to do. And I'm just going to kind of fall over. And so what we would rather is a pattern where it says, got it scheme has changed, I'm going to load up what you need to do until the point where you've changed that schema. And when you put that in place also, I'll test it outside. Yeah, that schema that seems to be I can see realize that now I'll continue on. And what I find is that if you don't embrace this technology spend most of your time in just re processing data through ETFs.
Tobias Macey
0:37:13
And so it seems that there actually is still a place for workflow engines or some measure of ATL where you're extracting the data from the source systems. But rather than loading it into your data lake or your data warehouse, you're adding it to the graph store for then being able to pull these mappings out and then also potentially going from the graph database where you have joined the data together, and then being able to pull it out from that using some sort of query to be able to have the maps data extracted, and then load that into your eventual target.
Unknown
0:37:49
I mean, what you've just described, there is a workflow, and therefore, you know, the workflow systems, they still make sense, they're very logical to look at these at these workflows and say, oh, that happens, then that happens, then that happens, they completely still make sense. And I still actually use in some cases, I actually still use the ZTL tools for very specific jobs. But what you can see is if we were to use these kind of classical workflow systems, and you can see the eventual connectivity pattern, as you described, it's just one step in that overall pattern, that I think what I found over time is that, no, we use these workflow systems to be able to join data. And I would, I would actually rather throw it to a an individual step called eventual connectivity, where it does the joining and things like that for me, and then continue on from there, you could very similar to the kind of example you gave is, and that I've also been been mentioning here as well as there will always be things you do after the graph. And that is something you could easily push into one of these workflow designers. Now, as for an example of, you know, the the times when our company still uses these, these these tools out of our customers, I think one of the ones that makes complete sense is IoT data. And it's mainly because it's not typical, in at least the cases that we've seen, that there's as much hassle in blending and cleaning data, we see that more with, you know, operational data, things like transactions, and, you know, customer data and customer records, that typically quite hard to blame. But when it comes to IoT, IoT data, you know, if there's something wrong with the data that it can't blend, it's often that, well, maybe it's a bad reader that we're, you know, reading off instead of something that is actually dirty data. Now, of course, every now and then, if you've worked in in that space, you'll realize that, you know, readers can lose a network can make and, you know, have holes in the data. But I mean, eventually connectivity would not solve that either, right. And typically, in those cases, you'll do things like impute the values from historical and future data to fill in the gaps. And it's always a little bit of a guess that's why it's, it's it's way puting it. But, to be honest, if it was my task to build a unified data set from across different sources, I would just choose this eventual connectivity pattern every single time, if it was to have to a workflow of data processing, where I, I know that data blends easy, then there's not a data quality issue, right? Where there is, you need to jump across multiple different systems to merge data, I just, Time after time have found that, you know, these workflow systems, they reach their limit where it just becomes too complex.
Tobias Macey
0:40:53
And for certain scales or varieties of data, I imagine that there are certain cases that come up when trying to load everything into the graph store. And so I'm wondering what you have run up against as far as limitations to this pattern, or at least alterations to the pattern to be able to handle some of these larger volume tasks.
Unknown
0:41:15
I think I'll start with this, the graph is notoriously hard to scale. And most of the graph databases that I've had experience with and you're essentially bound to one big graph, so there's no i, there's no kind of idea of clustering these data stores with, you know, some graphs that you could query a cost. So scaling that is actually quite hard to start with. But I think the limitations from the pattern itself, there are many, I mean, it starts with the fact that you need to be careful, I'll give you a good example, I've seen many companies that use this pattern, and they flag something like an email is unique. And then we realized later modern, no, it's not, we have merged records that are not duplicates. And this means, of course, that you need support in the platform, that you're you're utilizing the ability to split these records and fix them and reprocess them at a later point. But I mean, these are also things that will be very hard to pick up an EGLLT types of battles. But I think one of the other you know, downsides of this approach is that up front, you don't know how many records will join your kind of like the name alludes to, eventually, you'll get joins or connectivity. And you can think of it as this pattern will decide how many records it will join for you based off these entity codes or unique references, all the power of your inference engine, when it comes to things that are a little bit fuzzy, unique, a fuzzy ID to someone things like you know, phone numbers or things like that. The great thing about this, it also means that you don't need to choose what type of join that you're doing. I mean, in the relational world, you've got plenty of different types of joins, your Inner Joins, outer joins, left outer Left, Right outer joins things like this. In the graph, there's one join, right? And so with that pattern, you know, it's not like you can pick the wrong join to go with there's only one type of thing. So if it really becomes useful when No, no, I'm just trying to merge these records, I don't need to hand hold how the joins will happen. I think one of the other downsides that I've had experience with this is that, let's just say you have, you know, system one, and two. And what you'll often find is that when you integrate these two systems, you have a lot of these shadow nodes in the graph, ie, sometimes we call them floating edges, where, hey, I've got a reference to accompany with an idea of 123. But I've never found the record on the other side with the same ID. So I have, you know, in fact, I'm storing lots of extra information that, you know what, I'm not actually utilizing it. But I think the advantages of saying, Yeah, but you will integrate system for you will integrate system five where their data sits. But the value is that you don't need to tell the system, how they join units need to flag these unique references. And I think that the kind of final maybe limitation or that i think that i found with these patterns is that you learn pretty quickly, as I alluded to, before, that there are many records in your data sources where you think a column is unique, but it's not
Tim Ward
0:44:45
it might be unique in your system,
Unknown
0:44:47
ie in Salesforce, the ID is unique. But if you actually look across the the other parts of the stack, you realize, no, no, there is a another company in another system with a record called 123. And they have nothing to do with each other. And so what we, you know, these entity codes that I've talking about, they're made up of multiple different parts they made up of the id 123. They're made up of a type something like organization, and they're made up of a source of origin, you know, Salesforce account 456. And so what this does is it guarantees uniqueness, if you added in to Salesforce accounts, or if you add in systems that have the same ID, but it came from a different source. And as I said before, good example would be the email. I mean, even at our company, we use GitHub enterprise to be able to store our our source code. And you know, out we have notifications that our engineers get when you know, there's pull requests and things like this. And it actually a GitHub identifies each employee as notifications at GitHub. com, that's what that record sends us as its unique reference. And of course, if I marked this as a unique reference, all of those employee records using this pattern would merge. However, what I like about this approach is that, you know, at least I'm given the tools to rectify that the bad data when I see it. And to be honest, if companies are wanting to become much more data driven as kind of we aim to help our customers with, and I just believe that it means we have to start to shift or learn to accept that there's more risk that could happen. But is that risk of having data, you know, more readily available to the forefront worth more than the old approaches that we're taking?
Tobias Macey
0:46:46
And for anybody who wants to dig deeper into this idea, or learn more about your thoughts on that, or some of the Jason technologies, what are some of the resources that you recommend they look to?
Unknown
0:47:00
So I mean, I guess the first thing and Tobias you and I have talked about this before is that, I think the first thing that that the white to kind of learn more about it is to kind of get in contact and kind of challenge us on this idea. I mean, when you you know, when you see a technology and you're an engineer, you go out and start using it, you have this tendency to kind of gain a bias around that, that, you know, Time after time you see it working. And then you you think, why, why is not everybody else doing this? And actually, the answer is quite clear. It's because well, things like graph databases, were not as ubiquitous as they are right now. You know, you can get off the shelf free graph databases to use and, you know, 1010, even 10 years ago, this was just not the case, you would have to build these things yourself. And so I think the first thing is, you know, you can get in touch with me at TIW included.com, if you're just interested in challenging this, this design pattern, and really getting to the crux of, really is this something that we can replace ATL with completely, I think the other thing you mentioned the white paper that you alluded to, that's available from our website. So you can always jump over to clued in.com, to read that white paper, it's completely open and free to everyone to read. And then we also have a couple of YouTube videos. If you just type, including I'm sure you'll find them. And where we talk in depth about, you know, utilizing the graph to be able to merge different data sets. And we really go into depth. And but I mean, I always like to talk to other engineers and have them challenge me on this. So feel free to get in touch. And I guess if you're wanting to learn much more, we also have our developer training that we give here, including which, you know, we compare this pattern towards, you know, other patterns that are out there, and you can get hands on experience with taking different data sources, taking the multiple different approaches is that are out there is integration patterns, and really just seeing the one that works for you.
Tobias Macey
0:49:04
Is there anything else about the ideas of eventual connectivity or ATL patterns that you have seen, or the overall space of data integration that we didn't discuss yet that you'd like to cover? Before we close out the show?
Unknown
0:49:16
I think for me, I always like when I have more engineering, patents and tools on my tool belt. So I think for me, the thing for listeners to take from this is that use this as an extra little piece on your tool belt, if you find that you walk into, you know, a company that you're helping, and they say, Hey, listen, we're really wanting to start to do things with our data. And they say, yeah, we've got, we got 300 systems. And to be honest, I've been given the direction to to kind of pull and wrangle this into something we can use, really think about this eventual connectivity pattern really investigated as a possible option, it's actually that to implement the pattern you can, you'll be able to see it in the white paper. But to implement the pattern yourself, it's really not complex. It just like I said before, one of the keys is to just embrace maybe a new database family to be able to model your data. And yes, get get in touch if you need any more information on.
Tobias Macey
0:50:22
And one follow on from that, too, I think is the idea of migrating from an existing ETL workflow and into this eventual connectivity space. And it seems that the logical step would be to just replace your current target system with the graph database and adding in the mapping for the entity IDs and the aliases. And then you're sort of at least partway on your way to being able to take advantage of this and then just adding a new ATL or workflow at the other end to pull out of the connected data into what you original target systems were. Yeah, exactly.
Unknown
0:51:02
I mean, it's it's, it's, it's quite often we walk into a business, and they've already got many years of business logic embedded into these ETFs pipelines. And, you know, my, my idea on that is not to just throw these away, there's a lot of good stuff there. It's really just complemented with this extra design pattern. And that's probably a little bit better at the whole merging and DJ application of data.
Tobias Macey
0:51:29
All right? Well, for anybody who wants to get in touch with you, I'll add your email and whatever other contact information to the show notes. And I've also got a link to the white paper that you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Unknown
0:51:49
Well, I would say a little bit off topic, but I would actually see, say that I'm amazed how many companies please I walk into and they don't know, what is the quality of the data they are working with. So I think one of the big gaps that needs to be fixed in the data management market is to be able to integrate data from different sources to be explicitly told via different metrics. I mean, the classic ones that were used to be accuracy and completeness and things like this, for businesses to know, what are they dealing with? I mean, just that simple fact of knowing, hey, we're dealing with 34% accurate data. And guess what, that's what we're pushing to the data warehouse to build reports, and that our management is making key decisions off. So I think, first of all the gap is in knowing what quality of data you're dealing with. And I think the second piece is in facilitating the process around how do you increase that a lot of these things can often be fixed by normalizing values, you know, if I've got two different names for a company, but thou the same record, which one do you choose? And do we normalize to the valley that's uppercase, or lowercase or title case, or the one that has a, you know, Incorporated at the end? And I think that that part of the industry does need to get better.
Tobias Macey
0:53:20
All right. Well, thank you very much for taking the time today to join me and discuss your thoughts on differential connectivity and some of the ways that it can augment or replace some of the ETFs patterns that we have been working with up to date. So I appreciate your thoughts on that. And I hope you enjoy the rest of your day.
Tim Ward
0:53:37
Thanks, Tobias.

The Workflow Engine For Data Engineers And Data Scientists - Episode 86

Summary

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Prefect is and your motivation for creating it?
  • What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
  • In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
  • How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
  • How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
  • What was your decision making process when deciding to use Dask as your supported execution engine?
    • For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
  • Does Prefect support managing tasks that bridge network boundaries?
  • What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
  • What are the limitations of the open source core as compared to the cloud offering that you are building?
  • What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
  • When is Prefect the wrong choice?
  • In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
    • What are some best practices and industry trends that you are most excited by?
  • What do you have planned for the future of the Prefect project and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building An Enterprise Data Fabric At CluedIn - Episode 74

Summary

Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

  • Introduction

  • How did you get involved in the area of data management?

  • Before we get started, can you share your definition of what a data fabric is?

  • Can you explain what CluedIn is and share the story of how it started?

    • Can you describe your ideal customer?
    • What are some of the primary ways that organizations are using CluedIn?
  • Can you give an overview of the system architecture that you have built and how it has evolved since you first began building it?

  • For a new customer of CluedIn, what is involved in the onboarding process?

  • What are some of the most challenging aspects of data integration?

    • What is your approach to managing the process of cleaning the data that you are ingesting?
      • How much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution?
    • How do you preserve and expose data lineage/provenance to your customers?
  • How do you manage changes or breakage in the interfaces that you use for source or destination systems?

  • What are some of the signals that you monitor to ensure the continued healthy operation of your platform?

  • What are some of the most notable customer success stories that you have experienced?

    • Are there any notable failures that you have experienced, and if so, what were the lessons learned?
  • What are some cases where CluedIn is not the right choice?

  • What do you have planned for the future of CluedIn?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA