Building A Knowledge Graph Of Commercial Real Estate At Cherre - Episode 127

Summary

Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Cherre is and the role that data plays in the business?
  • What are the benefits of a knowledge graph for making real estate investment decisions?
  • What are the main ways that you and your customers are using the knowledge graph?
    • What are some of the challenges that you face in providing a usable interface for end-users to query the graph?
  • What technology are you using for storing and processing the graph?
    • What challenges do you face in scaling the complexity and analysis of the graph?
  • What are the main sources of data for the knowledge graph?
  • What are some of the ways that messiness manifests in the data that you are using to populate the graph?
    • How are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure?
    • How do you handle missing attributes or extra attributes in a given record?
  • How did you approach the process of determining an effective taxonomy for records in the graph?
  • What is involved in performing entity extraction on your data?
  • What are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph?
  • What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data?
  • What are some of the near and medium term improvements that you have planned for your knowledge graph?
  • What advice do you have for anyone who is interested in building a knowledge graph of their own?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering podcast comm slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing john Madan about how cherry is building and using a knowledge graph of commercial real estate information. So john, can you start by introducing yourself
John Maiden
0:00:59
Hi i'm john Maiden. I'm a senior data scientist at cherry, I work on adding insights and features that we can extract from the data we process for our customers.
Tobias Macey
0:01:07
And do you remember how you first got involved in the area of data management?
John Maiden
0:01:10
So my background as a data scientist, I mean, I think most people say, Oh, you know, you focus on the science component of the job. But you know, data is actually a very important component of what we do, you know, you're especially, you have to think about the types of data you have the type of information it provides, that's always a big one, you know, you can have lots of data. But if it doesn't really say much, then it's kind of useless. And then when you get into the field of people use big data a lot. But you know, big data, when you're working with terabytes of data, you have to really think hard about what information is in the data. How is it contained? Where's it go? Is it going to be something that you care about and be useful? How does it all connect together? I mean, especially if you think about, you know, joining data, when you have terabytes of data, it becomes very hard. So data management is a big part of what data scientists need to think about and do and so
Tobias Macey
0:01:53
you are currently working at cherry, can you give a bit of a background on what it is that cherry is building and the role that data plays within the business.
John Maiden
0:02:02
So cherries is a provides data for our customers in the commercial real estate space. So what we're doing is making sure that they have all the data they need to be able to execute and make the best decisions for their their business, a lot of data tends to be very siloed Real Estate's one part of it, but a lot of industries have this problem to where you have the data you have, and you don't really think about, you know, you know, it's got value, but you don't really know how much value of God until you realize that there's a whole universe of data out there that can also be connected to your data. So what charity does is helps make those connections. So taking a lot of amazing public data. So you know, my I'm a big fan of New York City open data, they provide a lot of great information that we can use. There's other cities that are also buying their own initiatives to provide great open source data that we can use to drive business connecting that data as well as third party paid and data partnership data that we pull in. So there's a lot of other companies that provide a wide range of data that is useful for the commercial real estate space. So besides the usual tax and transaction information, You've also got demographic information, you've got transportation, you know, there's just lots of different aspects. And our customers have many different ways of thinking about the data. Our customers can be brokers, they can be property developers, it can be insures that can be financials. And each one of them has different cares and interests in the data. So it's getting all of their interests and all their needs consolidated, putting that data together, as well as being able to combine their data into the pipeline. So getting out of that silo. So not just saying, Oh, you've got that data, it's, you know, the data you have is useful, but then taking that data and then combining it across the wide variety of data available really makes a powerful offering to customers to realize that they can have much more insights with the connected data we can get.
Tobias Macey
0:03:41
And so for being able to connect all the data and perform some of the useful analyses for your customers, you have built it into a Knowledge Graph using some of those open datasets that you mentioned. And I know that you also purchase data sets from different brokers or real estate owners and I'm curious if you can just discuss a bit about the benefits that a Knowledge Graph provides in terms of being able to make informed investment decisions in the real estate space and some of the challenges that alternate representations of that data pose.
John Maiden
0:04:13
Yeah, so you know, as you said, you know, we get data from public sources, some of it we pay for, but some of it from a wide variety of data partners, knowledge graphs are, it's a, it's a useful tool, there's a lot of different aspects of data that, you know, the traditional database of property, you know, let's think of it from the real estate space. So from you have property, number of bedrooms, number of bathrooms, square footage, you know, there's all the traditional aspects of data that you collect. That's very useful that helps people get insights and information to the data, they have knowledge, graphs are a different way of organizing data. I mean, there's a lot of B look online, there's a lot of descriptions of a knowledge graph, you can look at it from a scientific perspective, from a computational perspective, but at the end of the day, it's about relationships with your data, you know, so you're saying that a is connected to be in this is how they're connected now To do that, you have to think about how your data is structured. It's about organizing the data in certain ways. It's also about what value you want to get from that data. Because there are many different ways that data can be organized and connected. In our case, you know, we're very interested about how properties are connected to people or corporations. And so extracting and going through those data sets and organizing the data in that way is very useful. But then that allows us to build a lot of great products on top of that connected data where the emphasis is on the connections, right, that's why we build a knowledge graph is to show how data is connected.
Tobias Macey
0:05:31
And then in terms of the end users of the Knowledge Graph and the analysis being performed. I'm wondering if you can give a bit of a flavor of the interactions and the types of questions that are being asked of that knowledge graph and some of the challenges that you face in terms of being able to expose that underlying graph in a way that's intuitive and easy to use.
John Maiden
0:05:52
Yeah, so to use the knowledge graph that we have, I would say for us, we see it as a internal resource. It's something that we were the Internal uses for it right now. So a lot of what we're doing is we're querying it through databases, we've got some graph tools as well, a lot of it is analysis. So we've got a great data science machine learning team, we're spending a lot of time just analyzing the data and looking at what we want to get out of it. It's a hopping off point for other products that we can build. And so I would say, you know, the visualization tools right now are, it's complicated. So originally, we built a smaller knowledge graph, just based on New York City data, when we did that there was 10s of millions of edges. And just visualizing that in a graph database was kind of hard because we would say, Okay, I want to look at a specific property, and it would have hundreds of connections based on how we collected the data. So you know, you can filter it down, you can try to emphasize the way that everything's related, but still a lot of big data dump that makes it hard to know you can see all these connections, but at the end of the day, what's what is, is relevant. So for us, a lot of it is especially when you get to the national data, national data has got billions of rows, so are billions of edges. So with billions of edges, it's even harder. To I would, I love to say visualizations important, it's something that helps us understand the data. But you know, traditionally, what we've been doing so far is just organizing it as a graph with edges. We're doing a lot of analytics. And then we're also doing machine learning on top of that, to try to extract insight. So the Knowledge Graph for itself at the moment, it's useful, it's, it allows us to see the data. But you know, visualization, it gets very hard when you have billions, you know, hundreds of millions of nodes and billions of edges. And it's more about aggregate statistics. So you have to think about it in terms of big data. So if we're trying to you're trying to use this for something, the Knowledge Graph should be a driver for your products. So for whatever you're building, what's the coverage you're expecting? What's the accuracy? So you have to think about it more quantitative terms. If you have a specific use case, like if you're looking at a specific property, then you know, visualizing, it's great, but generally we're working at we're working with the data in aggregate and trying to figure out across nationally, how does this information help us
Tobias Macey
0:07:56
and I know that in terms of gratification, meishan There have been a few different approaches to being able to actually store and analyze it starting with some of the RDF triples, and using a triple store or using a graph model that is stored within a relational database. And then there have also been engines that are specifically built for storing and processing graphs and graph algorithms, such as Neo for j being the most notable, but also things like D graph or Tiger graph. And I'm curious what you're using as the actual storage engine for being able to have this information and query it and any challenges that you're facing in terms of being able to scale it to the volumes of data and the numbers of queries that you're trying to process.
John Maiden
0:08:39
Yeah, so the example I gave would be then when we did the New York City knowledge graph, which was the ramp up to doing the national one, so the New York City one that was something that you could process on a single computer and it had 10s of millions of edges, but you could still do enough analysis, I mean, a lot of it was extracting data from multiple sources. So this is the you know, the complexity of first Before you build a good knowledge graph, you just need the data in place to be in the first case. And so a lot of our engineering efforts, and we've got a very large dedicated engineering team is focusing on building out resilient, repeatable, scalable pipelines that pull in everybody's data in a consistent way, in a safe way. And once you get that data there, then it's very easy for, you know, data scientists and machine learning engineers to go ahead and say, Okay, these are the sources I need, I need to pull them out, I need to extract all the pairs. So you know, we didn't go through the traditional relationship extraction that you normally would do. I mean, for us, please. For New York City, it was very straightforward to just pull the data of you know, we're looking for properties to people properties to addresses, we had always built on all public data for the New York City one because New York City provides a great set of data. So bit of Python, a bit of where Google Cloud shop, so a big query that got the data into the right format to start with, in terms of visualization yesterday was very important because this was our initial play. We wanted to figure out does it make sense it's, you know, it's hard to make Imagine Can I do this in with New York City since that's where we're from that at least was something we could focus on and know clearly, like, you know, we know New York City very well, did we get this right or not, you can see the data right in front of you. So you know, Python, Nia for j, were great for the initial analysis, scaling becomes very hard. I agree. And our emphasis is more on the analytics. For us, the knowledge graph is a driver for new products that go in front of our live customers. So being able to process a graph at scale, as well as being able to do analytics on it is more important than, you know, being able to visualize at all. So our focus has been on so building out a resilient pipeline ingesting the data to then you know, putting everything in BigQuery, because obviously, we've got billions of edges to work with. I'm a big fan of Spark. And so that's what we've been using to drive it has been Spark, especially graph frames, because that can ingest the data, it can quickly find the patterns we need. It can do a lot of analysis quickly. And so that's where I'm a big fan of is I'm more on the machine learning analytics side and I think that's a very Good general tool for performing bulk operations. I mean, it only takes you know, the process a couple of billion edges takes a couple of hours to run on a decent sized cluster. And that makes me very happy.
Tobias Macey
0:11:11
And another issue of being able to build this type of data store and do the entity extraction and resolution and being able to establish the edges between the different nodes within the graph. There are a lot of challenges because of the fact that with pulling from multiple different data sources, I'm sure they don't all have the same representations and a sort of common schema or format. And I'm wondering what the main sources of messiness are for this data set and some of the approaches that you're using to be able to clean and normalize the information.
John Maiden
0:11:42
Yeah, so the data is extremely noisy in many different ways. I mean, I've worked with many large noisy data sets. This one real estate in particular, because a lot of real estate data is collected at the county level, it just how things are done in the US, which means that you have varying levels of information formatting I mean, zoning codes vary everyone's different. So it's not like if I say I've got a national database of real estate data, you know that there's going to be tons of inconsistency. So some of this stuff just involves having great analysts who really know their data. So we've got a good portion of the team has strong real estate knowledge and couldn't quickly look at the data and say, this makes sense. Or this doesn't, you know, this is relevant for what our customers care about or not. So having subject matter experts is always a critical part. You know, if you're going to build a knowledge graph, the graph part's cool. But you also need the knowledge. And so going through the data determining what is relevant what's not with our data yet, again, because we're coming in from multiple sources. Sometimes you just have to think about, you know, you have to apply some very strong business logic to the data, finding, you know, missing data components are very hard. So trying to find ways to either fill them in or at least provide enough information that you can complete the graph is useful now. I mean, most of the time, we have a very complete comprehensive data set. But you know, not everything is going to be perfect, especially when we're trying to combine the data on the messiness side, you know, not just, you know, just because we have providers, or we ourselves put the data together into a national database, it doesn't really mean it's always connected. The biggest ones that drive us crazy are addresses and names. And so to build a powerful knowledge graph, I mean, if anyone's worked with real estate data, you probably know that like addresses are very complicated. You can have multiple addresses that all mean the same thing, but are written differently. So we used to be on Sixth Avenue you can write Sixth Avenue is the number of 60 HSZH Avenue, the Americas, those are all valid addresses. But then if you've got multiple datasets that each use the different spellings, those are going to lead you into different points. You can't connect the data that way, especially if you have you know, you put in the wrong state, maybe you transpose the zip code. So address standardization is a big effort that's, you know, to get the knowledge graph, putting the data together is not important, as important as being able to clean it up well, and our big efforts been on address standardization. So making sure that all the addresses we Get from all the different sources all match together, that's a big lift in itself. And that's something a service that we provide we use internally to clean our data as well as provide to our customers to allow them to connect their data, entity resolutions and other big one, buildings can have multiple addresses. So what most people see in the building is probably the mailing address. But you know, with a range of addresses for different buildings, you also have to put in time and effort both from data science and an engineering perspective to resolve all the different data sets. I mean, our current office is you know, has a street address and an avenue address, which means that depending on which data set you're using, you would have one data set pointing to our street address and other into Avenue dress you got to resolve both of those actually meet in the middle. And the last one that's tricky is names. So everyone you know, think that all names should be easy to do. Every different data set has different ways of formatting names, especially when we're trying to connect them across different disparate data sets. So one data set might have made in common john, another one might have john Maiden, a third set, we'll have john who Are these all the same? John's? Is there a different john out there? You know, if there was a typo and someone had john q maiden versus john w maiden? How do you guarantee that you know that these are the same people that the middle initials not important, or you know enough that they are the same or they are distinct people, and you have to keep them separate. So name resolution is very important. The other problem is also and this ties back to putting everything into a graph is that certain names are very common. You know, you there are definitely many john doe's in New York City may not john doe's per se but there's a lot of people with very common names. You know, if you look at the data, you could say, you know, this one guy owns 20% of New York City real estate, that's not really the case. Because if you then put everything into graph and you start looking at all of the different networks, you can say, Well, obviously, all of these different connections, go to this one person, this one name, john doe owns all these properties. But if I look at from a graph perspective, I really see disparate networks. So there is a back and forth and building a knowledge graph is very much an iterative process. So there's the cleaning the data as well as you can. There's putting every into a Knowledge Graph format, then there's looking at how the data actually connects together and then doing further cleaning and iteration, because you're never gonna get anything perfect on the first try. But, you know, it's you need clean data to build the graph. But once you build the graph, you can still see How noisy your data is. And you can do further rounds of iteration. And so be able to do that is very important. So it's not just, you know, the data in one place, you know, a knowledge graph is not just about getting the data in one place. It's also recognizing all the effort and time that goes into getting it there in the first place and making those connections. And for us, you know, addressing his name centralization is very important to be able to make it as powerful and as useful as we can.
Tobias Macey
0:16:39
Another element of the entity resolution is things like businesses that have multiple doing business as names and I'm curious if you try to collapse those into a single entity as well or do you keep them as distinct entities and just represent them as the specific business name that is on the different titles and deeds?
John Maiden
0:16:57
We try to resolve them back to us A parent company. So one of the big features one of the big use cases for knowledge graphs at the moment is trying to find owner and masking. So a lot of commercial real estate properties for multiple reasons that tend to be owned by separate LLC s. And especially some of the big players, you'll see, like they'll have lets you know if you have 123 Main Street 124 Main Street, and they're both owned by the same company, the owners on the tax rows will be 123 Main Street LLC and 124 Main Street LLC, it becomes very hard because you know, if you know personally, these are owned by the same owner, there's no clear way to connect them back. And for our customers, they care about who the true owner is. So that's one of the use cases for a knowledge graph is getting all the data together so that we can do the connections. So going from property to all these other dots going all the way back to a true owner. So trying to actually get back behind and trying to get away from the LLC and all the intermediaries. Another part of name resolution, I think you kind of touched on is that it's not always just a one to one. So a lot of the data sources we're working from, don't just say JOHN made and Tommy, john W. It'll say something like, you know, you know Bank of America on behalf of john Madan trust as represented by john w. n. So it's not just about cleaning up the names, it's also about doing multiple entity extraction and also prioritization, right? Like in that case, the entity that you care about would be something like the john madden trust, or specifically, john madden, he wouldn't care about the bank. So there's a lot of logic, it's not just about purely just making sure that you know, john madden shows up is in the correct format. It's also being able to recognize what are the true entities? So is this a person? Is this a corporation? Is this a trust? And you know, for the LLCs, particularly like, is this really, is this really the end point of the graph? Or is there really a further connection to go where, you know, this actually resolves up to big player number one.
Tobias Macey
0:18:47
And another element of challenges in terms of the consistency of the data is, as you mentioned, you're pulling data from multiple different counties and municipalities and states and they're all going to have different systems that they're storing that information in each which have their own restrictions on field inputs. And one of the common challenges in basically any input system is, as you said, name resolution and the name formats that are allowed. So in one case, it might be first name, and last name only, in which case, you know, maybe the middle name, or if there is a hyphenated last name, it'll get all stuck into the same field. Or if you have different cultures where they have different mechanisms or different standards in terms of what the family name is versus the personal name. I'm also sure that there are a lot of cases that you run into of Unicode naming being merged into some sort of ASCII representation. And I'm curious if there are any particular horror stories that come to mind in terms of just some of the challenges of being able to clean up some of the specifics of name or entities. And I'm sure also with some of the business's property information as well.
John Maiden
0:19:50
Yeah, generally, I think the the name resolution can be very tricky, especially with there's no one consistent way to look at names. I think there are some locations that are missing. Better than others in terms of tracking information. So sometimes you have to go with some ambiguity, you know, it's better to be slightly more ambiguous than trying to join stuff together. like going back to the example of if I had a john Coumadin and a john W. Madan. And I really couldn't tell if they were separate people or not, maybe it's a little bit better to have both names. I mean, especially if for our customers, I mean, for the owner masking, particularly brokers care about this. So brokers say I found a property I really want to I've loved this property, I think my customers really want to buy it. I don't know who to contact to make an offer, if as long as you have my contact information they don't particularly care about if it's done who john key made. Now, that doesn't mean that we're done iterating. And it helps the more information we pull in. So the more different data sources you can add in, it helps as well. triangulations important. So, you know, the New York City graph everything was built with public data, but we have such a rich set of data across you know, tax information, transaction information, building information. There's so many different rich data sources that that helps narrow down to just from the data itself, it's a balance, right? So if you've got great data sources, then you don't have to worry as much about using data science or some of the really cutting edge NLP techniques. Yes, you are. But that also assumes that you've gotten York City where New York City is very consistent homogenous in the way it's presenting its data, as you mentioned, once you start getting outside of the city, and you have to go across the country where you have, you know, thousands of different regimes when it comes to data collection, that's where you want to get as much good data as possible. But then you have to start thinking about, you know, really using some, you know, getting as many smart people as you can into room to tackle the difficult problems of getting the data there, getting it on time, making sure it's in reliable pipelines. And also, you know, rolling out a lot of a lot of cutting edge Gen p to try to analyze and resolve as much as you can, with the assumption that you know, you just, you're never going to be 100% perfect, and as long as you get the customer to where they need to be if you give them actionable data, I think that's the that's the most important thing is can you give them data that's actionable. And if that's the case, then they that's what they care about.
Tobias Macey
0:21:58
Another aspect of Have messiness and data and the different regimes of data that you're getting the information from are things like missing attributes, or extra attributes where one record might have name and property name and geographic location. Another one might just have the name and property name. And you have to do your own figuring out of where it is in terms of latitude and longitude. And I'm wondering what your strategy is for being able to handle that inconsistency and the availability of specific attributes and the approach that you've taken to determining and effective taxonomy for representing all of these different attributes of the data within the graph?
John Maiden
0:22:37
Well, so I would say that, depending on the use case, sometimes you don't need that much that very complicated knowledge graph. So the Knowledge Graph itself is primarily about relationships. So if you care about a connecting to b, f, a property connects to this person or this in this company or something like that, like sometimes that's just enough, and I would say, knowledge graphs are important for the connection perspective, I would say say that, you know, some of the other use cases that we care about. So we do actually care about a lot of the building features and a lot of information that we can gain from adding in other data sets, you know, sometimes you're just gonna have to, you have to think about what data is critical. So you'd like to have as much data as possible, sometimes you're gonna have to sacrifice you know, you want more coverage is more important than accuracy in certain cases. And then you have to pivot to what is what's the story that you want to tell, you've got the data, the knowledge graph is there, you're kind of limited by the quality of your data, you can either say, a couple of different routes, one might be, I really want to tell story A, and I really need to get this other data set that will help me get there. And if it's something that's obtainable, then that's an effort that we have to put in to make sure that we can actually communicate what we need to communicate with the data. If it turns out it's just not there. I mean, sometimes there are things that can be implied or extracted from other data sources that might be tangential. Or it might be that you know, this use case it's a great idea. It's something that we'd love to tackle, but it's not something that's practical with the current state of knowledge and But it's also iterative. I mean, I think once you start building knowledge graphs, that there's certain things that you can learn from them that you can then feed back into the Knowledge Graph itself. Yeah, whether it's, you know, once you've done owner and masking, you can learn about portfolio ownership across, you know, multiple locations. And that becomes an attribute or information that you can use to drive further insights. But it's also that if you have enough data, you know, if you've got, if you have enough data, if there's a small percentage where you're missing coverage, you can potentially impute it or work around the fact that you don't have perfect coverage around the country. I mean, especially nationally, that's a lot of different data points that you can use to fill in the blanks.
Tobias Macey
0:24:36
Another aspect of data quality is the question of freshness and being able to understand how recently was this data acquired. And from that, you also need to be able to go back to the source and say, at what point was this data set generated at the source to be able to ensure that you have up to date information because particularly with real estate properties might change and somewhat rapidly and you want to make sure that you're contacting the appropriate It owners. And I'm wondering what your approach is there in terms of being able to ensure that freshness of data and the data lineage tracking to know how accurate and how up to date that information is?
John Maiden
0:25:10
Well, it's a combination of Business and Technology. So on the business side, identifying the partners that can either either we're buying it or obtaining it, or we have a data partnership. So finding the companies that give us the most fresh data on a repeatable basis, that also gives us the best coverage. And you know, having done a lot, you know, haven't been in the data space for a long time, you might not have one partner that gives you all the exact coverage you need. Maybe they're really good in one vertical, but not another. And so you might have to do some piecemeal work to create the best data set from as many different components as possible. And if you can find a partner that gives you everything you need, you know, pursue them, make sure that they're part of your, your platform. on the technology side, keeping it fresh is investing in engineering and technology. So making sure that you've got a strong team that's empowered to build really good quality, scalable data. I mean, the data we have, you know, some of it small, but a lot of it is you know, terabytes. data that gets processed, making sure that it's adjusted automatically without bugs on a repeatable process. That's I mean, data freshness is also about quality, quality control. So it's, you know, finding the data you need, making sure that it can be delivered to customers, but also putting in place monitoring and other control processes to know, you know, has the data, you know, what's going on with the data? Are we seeing the same data being refreshed again? And again? How is it delivering cleanly? Are we getting the coverage that we expect? Is this you know, the data we expected to see on a regular basis? Are we getting the coverage? And so there's a lot of moving components to guarantee freshness.
Tobias Macey
0:26:35
From the perspective of leveraging the Knowledge Graph for doing analysis, what have you found to be some of the most interesting or unexpected questions that you've been able to answer or particularly notable insights that you've been able to gain by querying the graph and being able to do some exploration of it?
John Maiden
0:26:52
So the data in itself I mean, so the knowledge graph, I think is always cool. I like collecting data in the first place. And I think You know, sometimes you say, oh, john, your data scientist, you know, you got to do something with it. But you know, let's, let's recognize the amount of time and effort that's put into, you know, there's a lot of technology effort put into just getting a large knowledge graph that anyone can see useful. It's very easy to collect data together, put it in one place and say, Oh, we've got, you know, our database, we got tons of data, and we can do amazing stuff with it. So the the amount of time and effort just to clean to collect it, as well as cleaning process it. So a lot of engineering work a lot of data science, machine learning efforts to, you know, standardize and clean the data. I mean, in itself, just getting a knowledge graph that works. That's always something that makes me excited. But yeah, once we start collecting the data, I mean, I'm always excited, especially when it comes to big data, like how do you how do you actually verify big graphs? How do you verify that you did something right? When we're doing owner and masking, we've got hundreds of millions of properties to check. You can't check everyone individually, and you can't guarantee that you get 100%. Right. So looking at it, looking at the top results saying Yep, I definitely got this going. I mean, I know that's kind of a large scale. Answer. But generally it's like, you know, can you build something at scale and get it to work? And then when you see the the results coming out of it, it's like, yep, that Max, it's exactly what I expect to see, that's generally makes me happy. But it's also, you know, if I can put in specific properties, I mean, we're based on New York. So, you know, start querying all the big, famous properties and like, yep, we got this, you know, the building we're in, we got that right, the building next door, we got that, right. Yeah, I think it's sometimes the small things, even if those aren't necessarily important to our customers, it's being able to get the big picture as well as some of the smaller resolution and seeing like, yep, we actually, you know, the data pulled this out, like when you see that connections that you wouldn't have seen, I think that's the big one, especially when, you know, given that we have a graph, it's not, you know, I think when you think about traditional data sources, you say, Okay, well, I'm just gonna put all this data into a database, I'll do a join of table a table B, and then I'll see that Oh, obviously, this property is owned by that person, because that's how the data connects. It's not as straightforward and especially when you're doing a graph, it really can be multiple hops, or it can be some type of aggregate information on the data and so When you're seeing results that come from two or three hops out, it's pretty accurate like, like I, you know, if I get a result quickly from owner unmasking, and I then have to sit down and say, Okay, well, I'm gonna look through this and see where the data came from, you know, I have to look through a couple of different sources to find the connections but totally was solid. That's a really powerful graph. And that's what makes me happy.
Tobias Macey
0:29:19
It seems to that because of the fact that there is such a dearth of information available to people in the real estate space, that there is probably going to be a bit of an increased tolerance for some level of ambiguity or uncertainty in the data just by virtue of the fact that they're already getting access to more information than they would have had otherwise. And I'm curious what your thoughts are on that level of tolerance within the real estate industry, and some of the challenges that people in other verticals might be facing if they're trying to leverage a knowledge graph and be able to account for some of these sources of uncertainty.
John Maiden
0:29:56
Right. So I think going back to my earlier comment about the dynamic data being siloed, like having the data that most companies have usually have really great, insightful data. But it tends to cover one very small part of the market. If you got a broker, a property developer, maybe they only work in one neighborhood. So they care about this neighborhood they own they care about this type of vertical. That's their strength. That's the power that's great for them. But the question is, like, if you think about expanding that information, if you just get a little bit more information, what can you do with that? Could you potentially go to a neighborhood you haven't heard about before you know about, but you don't really know that much about and be able to find great properties that also get you what you need to do. I think it's it gives you the strength to be able to expand your horizons and look further beyond what you currently know, because there's so much more data out there. I mean, it's amazing the types of data we can collect normally, but it's also validating a lot of the insights that they have. So because they only have one small piece of the pie of the puzzle, it's being able to see that you know, the data has there multiple aspects. It's not just about you know, sales per square foot or lease per square foot. It's About demographics, it's about how the neighborhood's changing. So looking at the data over time, that's something that you know, is not always available to a lot of these players. So having time information having being able to look at potential future prospects, so neighborhoods or properties that aren't yet there yet, but probably would be worth your time in six months to a year. So I think that's a very important thing is that once you open your mind to a lot of the other options, if you expose yourself to different sets of data, then you can make yourself and make your your business a lot stronger. We recently had a hackathon and our teams, you know, we had great cross functional teams that were doing a lot of great projects, a lot of what we did was trying to investigate some of the common thoughts in commercial real estate, you know, what are millennials trying to do? How do we determine what an up and coming neighborhood is? And so, you know, beyond the traditional gut feeling of what people do, and I know there's a lot of data driven players in the space, but you know, given the data that we have and the insights we can generate, I think, you know, within the company, we had just tons of great ideas and it's you know, from us, you know from within the data that we can see, I'm sure that our customers will see even more than we can, if they have their data. And they put it one place and are able to expose themselves to a lot of different ways of viewing the data and seeing it over time.
Tobias Macey
0:32:12
And as far as your experience and your team's experience of working with all of this data and the disparities in it, and being able to build a useful knowledge graph out of it and build products based on that, what have you found to be some of the most interesting or unexpected or challenging aspects of that work?
John Maiden
0:32:30
I mean, from a data perspective, so generally building a knowledge graph, it's trying to determine what's the relevant data. So I think you mentioned asked earlier about taxonomy, right? So originally, when we started to build the knowledge graph, our taxonomy was just focused on we have properties, we have addresses, we have people and we have companies and that's kind of a you know, very stark. Let's do this as a as a original version, build it out and see what happens as we started building and getting feedback, both internally externally, we realized that well, you know, I think Define someone as a company is important. But our customers do care a little bit more about fine grained information. So adding in, you know, they care about if this owner is an educational institution they care about, if it's a government owner, I mean, there's a lot of government owned property. So those aren't necessarily going to be on the market. So adding in some richness is very important. And getting feedback was a big thing. So you know, starting simple and adding feedbacks important along the way, the challenges were getting it to scale. So having, you know, billions of edges, you know, Google Cloud can handle a lot of this data very cleanly getting into spark and getting into process and making sure it doesn't blow up. I mean, if anyone's used Spark, that's a very common problem is, you know, just because the code looks clean, and you know, exactly what it should be doing doesn't mean you're going to get like what lots of weirdness with memory errors and the like. So scalability is always important. If we've got a couple billion edges. Having a process now runs with no problem, but eventually we would add in additional data sets, we're going to have many more edges. And so on the technology side, scalability is always important. Making sure that we can add an additional features add in more. And definitely, you know, I put in a lot of late nights and the team has also worked a lot on this to make sure that we can make something that cover you know, was a national coverage national is big and scalability, repeatability making sure that it fits in a production system. That's a big challenge for us.
Tobias Macey
0:34:18
Looking to the near and medium term, what are some of the improvements or enhancements that you have planned to the actual content of the Knowledge Graph itself, or the pipeline and tooling that you have to be able to build and power the graph.
John Maiden
0:34:32
So on the data side is to add in additional data sets, I mean, I think we've got a good mix of data so far, focusing on a lot of, you know, real estate data. So it's kind of general, but basically, you know, the traditional tax and transaction information, to be able to do entity resolution, you definitely need a good source of information about corporations. So those are always good feedback, but then just getting more data that's relevant to the space so there's additional data sets We've identified that we think are critical to connecting the dots. I mean, as I said, the more data sets, you can use a triangulate or supplement each other that's important. So especially when you're doing a graph, it's more about the relationships. It's not just about the relationship one thing has to another, but it has interconnectedness. So if a property is connected to an address connected to a person, if that person in the property are also similarly connected, that makes it much more stronger connection, you can say, you know, I've got this relationship that I know is solid, as opposed to sometimes your data sources might have some random noise. So I guess that also ties back into your issue about noise is the more datasets that you can use that can empower the existing data you have that's very important, especially on a graph structure, because you want to have lots of overlapping data sources. So anything that we can identify that makes those connections stronger or highlight specific connections as opposed to others. That's what we're really looking for on the data side on the technology side. Production ization, production ization production ization. As a data scientist. My big focus has always been on making process. As I can run scalable, repeatable, not manually. So anything that we can do to make sure that everything runs smoothly, we can ingest all the data sources, we need to we can extract all the pairs, we can clean up the addresses the names, we can build the graph, we can run the jobs, and then have this process on a regular basis, given all the data that updates. That's my big focus. Like I think that's a big accomplishment is the data science is cool. I love doing data science. And I love the the implications it has for customers, but making sure that it runs consistently and repeatedly. That's what makes me happy.
Tobias Macey
0:36:32
And for anybody who is interested in building a knowledge graph of their own, or in the early phases of that process, what are some of the pieces of advice that you have or any useful references that you can point them to?
John Maiden
0:36:44
So building a knowledge graph, there's a couple of good resources. Generally, if you look online, you're going to find lots of great resources pointing you to Google. So I think before you go into building a knowledge graph, you should have you have to ask yourself, what do I need to do like sometimes traditional storage databases work very well for the types of problems you want to solve, you have to ask yourself, I'm building a knowledge graph, because I care about how data is connected, I care about the relationships between data, and I'm looking at more than one hop away. So if it's simply just a is connected to B, that's something you can get from you're working with a traditional database structure. It's more about how nodes connect to each other, how they connect to their neighborhoods, and the structure of the neighborhood. So you have to think about, you know, it's it's about multiple connections, and how each of these connections interplay with each other. So that's the first question you should ask is, what's my business use case? What am I trying to accomplish? Yeah, I think that emphasizing data relationship is very important, then that's something that I want to I need a knowledge graph to be able to build an use. From there. I think the next step is to think about is the Knowledge Graph in itself and then go or do I need to build stuff from it? So sometimes having the data just in one place is all you need to accomplish your goals? You've got data, you can see it, you can visualize it. Perfect. You're good. So yeah, I would say think about where The end product is for us. The knowledge graph is a resource that we use to build off of. So we have to think about when we think about how the knowledge graph is built in process, we think about it in terms of our end products. How do we well support odor masking? How do we support a lot of the other goals that we have for this dataset? So first is business use case, then once you have a clear business use case in mind, it's think about where is this going? How do I construct this data? What data do I need? Right? So starting from the business use cases take a step back and say to get there? Is there specific data sets I need to build off of or ingest? Are those things that can easily be joined together? Or do I have to do additional processing and time and effort? Is this something I can build or buy in? So this is Yeah, before we get into the modeling, I'm sure if you're a data scientist listening, and you're like, but what about the model about like, think about the process and what you're trying to accomplish. So getting there is very important, but starting from the end goal is critical. Otherwise, you have a giant, giant collection of data. And everyone's like, Oh, that's really nice, but you know, what does it do so business use case first from there, too. step back and think about what data you need to support it. And then you know, then you start to get into the Okay, now we definitely need this. Let's look at the technical aspects. In that case, you have to look at Well, okay, so how am I going to support this? Is this going to be small amount of days is going to be big? Is this something that I can fit into? If I start building this? Is this something that I can easily put into something like Neo for j? So understanding that from a technical perspective, or some of the other great competitors out there? Like, is this something that I need to visualize? Or is this something I need to process? You know, potentially, there's some great Python packages out there. So I'm a big fan of network x. So if it's small enough, is this something that I could run in network X to get the data I need from it? If I need to do this at scale, then how do I process it so I'm a big spark user. So I kind of default to spark when it comes to processing large graphs. I know that there are other great packages out there. So understanding spark and graph frames are also great technical resources, depending on what your use cases you know, you might want to think about you know, getting into more of the machine learning aspects. So this is these are graphs are you going to be doing graph analyst On this. So are you looking for connectivity? So maybe you want to find a good book on the basics of network structure about how a graph gets analyzed? If you're trying to get more complicated, are you then starting to think about, you know, deep learning? Do you want to do crazy things? Like, you know, am I gonna do graph? embeddings? Do I need to do graph deep learning models? Like those are? Those are the things that you have to think about? Like, how down how far down the rabbit hole? Do I need to get my business accomplishment done? And so in that case, you know, it's iterative of start with a business, start thinking about how to get it and analyze it. And then you think about like, Can I do this with basic analytics? Do I have to start thinking about much more complicated packages? Do I have to start going through some of the graph literature, I mean, if you're interested, there are good papers out there, I saw a good paper and I'll probably get you a link later, there was a good paper that kind of did a very comprehensive this has everything to know about knowledge, graphs approach. It is a published paper. So it's going to, you know, give you a very deep perspective from a computer science perspective on what knowledge graphs can do. But you know, generally It's where does this need to go? How do I accomplish it? And then from there, the engineering and machine learning challenges that will get me to my goal.
Tobias Macey
0:41:09
Are there any other aspects of knowledge graphs in particular, or the work that you're doing at cherry or the challenges that you're facing that we didn't discuss that you'd like to cover? Before we close out the show,
John Maiden
0:41:19
I would just say that building a knowledge graph is a big team effort. I mean, there's a lot of different moving parts. And it's a mix of so there's the challenges of on the business side, identifying what data is relevant and useful for our customers. There's the challenge of having a strong engineering team to build out the pipelines and getting it running in a repeatable, scalable process. And you also having a good data science team that is willing to tackle big problems that particularly at scale, and the end of the day, it's, you know, gathering the data is important. It is very important to also understand the data. I'm a big fan of understanding. There's some people that say, you know, data is data, I'll throw it into a big model and the model will figure it all out, you know, understanding where the data is coming from. Understanding how useful it is understanding which features or which sources are more relevant than others. You know, having good business knowledge is always very important. And so when you're trying to build this and you're trying to do knowledge graphs, we're trying to work with data. In general, it's you want to have some company or work with, if you're doing this yourself, you want to make sure that you have a good mix of technology, as well as business expertise to really make it a powerful product.
Tobias Macey
0:42:23
For anybody who wants to follow along with you or get in touch about the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
John Maiden
0:42:37
So I see a lot of tools where I think everyone's focused on how the data is viewed. So you know, there's really good tools about tracking data. So as a company, we have we use airflow for our pipelines, we use DBT to manage a lot of our SQL scripts. I think a lot of there's a great a lot of great tooling on visualizing how the data is flowing. So data flows are very well understood the day how the data is connected through it. pipeline is very well understood, I would say as a whole through the industry, there's tend to be a gap about what the data is. Now over the years, I've worked with a lot of large engineering teams where I've said, okay, we need to find data that can do XYZ, because this is the business need. And they kind of say, Well, this is the data diagram. And like, okay, so which table gives us what we need? Like, I'm not really sure it I know, it's somewhere in this big messes, you know, somewhere here. So I would say that, you know, understanding where the data, what the what was the business value of the data, because collecting the data is important, but understanding where it came from, how it was used, what its limitations were, you know, going back to data freshness, how often is this updated, who were the final users, which also kind of ties back into what was the use case. So being able to attach business meaning and understanding to data sets, I think it's something that has a hole in terms of data management, most of the time, it's focused on data as an object where data is more of a living, breathing thing. And I think that should be there should be better tooling and focus on providing systems that allow you to recognize Is that as well?
Tobias Macey
0:44:00
Yeah, it's definitely something that I can agree with is that in engineering, it's all too easy to lose sight of the actual business value and business purpose of what it is that you're doing. So I appreciate your insight on that being a continued problem with the actual data assets that we're working with. So thank you very much for your time today. I appreciate all the expertise that you're bringing and definitely wish you the best of luck on your work at cherry and I hope you enjoy the rest of your day.
John Maiden
0:44:25
Thank you very much for your time as well enjoy talking to you.
Tobias Macey
0:44:33
Listening Don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcast comm subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!