Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.
Datafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.
Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.
RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.
RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.
- How did you get involved in the area of data management? Started as physicist and evolved into Data Science
- Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech
- What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another
- What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it.
- Can you give an example for the problems involved in entity resolution
Known entity example.
Empire state buidling.
To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.
- Identify the type of the object (lot, building, unit)
- Tag the object with all the relevant addresses
- Relations to other objects (lot, building, unit)
- What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding.
- I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved?
What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates
- How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses.
- What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time.
- What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets.
- What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys
- Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match
- What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking.
- What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure.
- Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing
- What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph
- From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email firstname.lastname@example.org) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Knowledge Graph
- Entity Resolution
- NLP == Natural Language Processing