Summary
The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Mark Hookey about Demyst Data, a platform for operationalizing external data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Demyst is and the story behind it?
- What are the services and systems that you provide for organizations to incorporate external sources in their data workflows?
- Who are your target customers?
- What are some examples of data sets that an organization might want to use in their analytics?
- How are these different from SaaS data that an organization might integrate with tools such as Stitcher and Fivetran?
- What are some of the challenges that are introduced by working with these external data sets?
- If an organization isn’t using Demyst what are some of the technical and organizational systems that they will need to build and manage?
- Can you describe how the Demyst platform is architected?
- What have been the most complex or difficult engineering challenges that you have dealt with while building Demyst?
- Given the wide variance in the systems that your customers are running, what are some strategies that you have used to provide flexible APIs for accessing the underlying information?
- What is the process for you to identify and onboard a new data source in your platform?
- What are some of the additional analytical systems that you have to run to manage your business (e.g. usage metering and analytics, etc.)?
- What are the most interesting, innovative, or unexpected ways that you have seen Demyst used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Demyst?
- When is Demyst the wrong choice?
- What do you have planned for the future of Demyst?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm interviewing Mark Hokey about Demist, a platform for operationalizing external data. So, Mark, can you start by introducing yourself? Hi, Tobias. Great to meet you.
[00:01:42] Unknown:
I'm Cookie. I'm the founder and CEO of Demist Data. And my background is in the intersection of data and analytics. I've been in this space for 20 some years. I'm very interested in the world of operationalizing external data into client workflows.
[00:02:00] Unknown:
And do you remember how you first got involved in data management?
[00:02:03] Unknown:
I used to be more focused in my career in analytics, and then we had a business that was purchased by a bureau called ChoicePoint that became part of LexisNexis and had a team of of really incredible data scientists as part of that, who had that typical challenge of just not even being able to get their hands on data. So I became very interested in the challenge of why is it when we're supposedly awash with data that people who can do great things can't tap into it. I started to look at the the problem underneath the analytics and have spent I spent a lot of time in the in the context of Demist researching and building out technologies around that.
[00:02:49] Unknown:
And so in terms of what you're building at Demist, can you give a bit of an overview about what it is that you're building there and some of the story behind it? So Demist is building an external data ops platform.
[00:03:01] Unknown:
And what we help enterprises do is is discover, curate, contract with, and operationalize external data. An example of that is a bank, for example, that needs to verify consumers or run financial crime checks typically has to work with an external ecosystem of of commercial data providers and open data providers to find out who people are, who they say they are, they have a job, they're a real person. And the world of external data vendors is very fragmented, and they all have idiosyncratic interfaces and different contracting approaches. And big enterprises, banks, insurers, and other enterprises have very, very high friction in integrating with and deploying those data sources. So we're building a Demist as a 1 stop shop to do that, under under 1 API, 1 1 contract.
We've been in business for 11 years, and we help some of the world's leading banks, insurers, fintechs, insurtechs, and we tap into what we believe is the richest ecosystem of information to help solve relevant use cases.
[00:04:17] Unknown:
As far as the services and systems that you're building out to be able to provide these external data sources to your customers, I'm wondering what are some of the capabilities that are necessary for being able to collect these various data sources, get them cleaned up, and presentable for the various organizations to be able to consume and incorporate into their own analytics workflows?
[00:04:42] Unknown:
Well, there's upstream, the sources themselves, and there's downstream, which is where we egress our data. Upstream, we've built our own technology platform that integrates with DataSource's own APIs. Everybody's got a slightly different API. They've got different schema. They've got different types. Some of them have batch data access, streaming, synchronous APIs, asynchronous APIs, consent based access, the whole gamut. So we have thousands and thousands of connectors that we've built 1 at a time into our platform. Downstream, we are allowing our clients to access it into the systems that they already use. So there's CRMs, Salesforce, Dynamics, people looking to tap into the value of data into those systems. There's API gateway technology, like MuleSoft and Apogee and other b to b data gateways.
There's data warehousing technologies, Snowflake, Redshift on AWS, data lakes in s 3, and there's decision engines in banking. It's systems like experience power curve technology or or might be insurance. It's, policy management system. So there's a variety of systems downstream as well, and we have adapters to allow people to pull the data. So we provide the the harmonization and standard schema and then, provide adapters into downstream systems.
[00:06:07] Unknown:
In terms of the customers that you're working with, I'm wondering if you can give a sense of the sort of types of industries that they're working in or the types of verticals or problems that they're trying to solve when they're incorporating these different external data sources and when it's not sufficient to use the data sources that are already internal to an organization?
[00:06:26] Unknown:
So external data is really, by definition, when an enterprise is solving a problem where the outside world knows more about the customer than the enterprise does. So things like customer acquisition and onboarding, where you're new to a bank, new to an insurer, new to a telco, where you don't have a rich, long history with that enterprise. They're the sorts of problems we tend to focus on. Now almost all large enterprises are using external data already. They're working with 1 or 2 credit bureaus. They're working with various other external data providers already. Some problems, though, by definition, also require a myriad of external sources. They require things like waterfalls and comparisons across external data sources.
Some problems, for example, are more dynamic, and they require the constant evolution of more and more and more data sources. So we tend to get involved in those sorts of problems as well. We're operating in a lot of different verticals and operating with very, very large global banks and insurers as well as scaling startups that are Fintechs and InsurTechs, working with travel and tourism. We're working with professional services firms and solution providers. So there's a whole range of different applications, but broadly, the domain of target customer is somebody who needs to operationalize and use external data to understand their customer, and that's broadly customers that are new to that institution or where they're trying to harness external information to service that customer that they don't have in their 4 walls. And what's really interesting, 1 of the things that's helping us scale is that the internal data engineering stack has matured a lot in the last few years, and that's allowed us to work with customers that have got their internal data into the right place and into the right gateways and technologies and catalogs and data lakes. And we've effectively bolted onto that and said to customers, we'll work within that stack. Now that you've got your internal data house in order, let's help to organize external data into the same infrastructure. Because as far as internal consumers and internal data engineers and data scientists at an enterprise are concerned, they don't necessarily care whether it's internal or external data. They just want the stack to work with both. But the external data has to be conformed into that internal state.
[00:08:46] Unknown:
And as far as these external data sources, I'm wondering what are some examples of the types of datasets that you're working with and providing to these different organizations and some of the potential origins of those datasets, whether it's something that you have to collect on your own and aggregate or if you are taking maybe, you know, publicly available government data but coalescing it into a form that's easier to work with and just some of the ways that just some of these types of data that are useful despite the fact that they're originating from outside an organization's boundaries?
[00:09:17] Unknown:
So there are some sources. If folks wanna take a look at demist.com, people can sign up and have a look at the sources. Some of them are government, as you mentioned, things like business registries. You know, when was the small business or large business registered, and who are the directors of that business? There's also commercial data providers, like the bureaus themselves, Equifax, Experian, and many, many other smaller commercial data providers that have different pockets of information that have been contributed or aggregated over time that includes customer information, business information, property and address information.
So for example, the insurance space, we partner with a lot of data aggregators that pull together things like court records and building permits to understand where the homes have been renovated and what types of roofs they have, and do they have swimming pools in the backyard and motor homes and security systems and various things that property insurers need to understand. These are often data sources that enterprises already ask customers about through manual workflows, but by going through these routes, processes can be automated. Another domain of external data sources is consent based sensitive information, such as people's credit reports or businesses' transactional information that comes from accounting systems, such as Xero or Intuit, where the consumer or the business is granting consent to our clients to access that information, and on behalf of the client, safely and securely pulling that information through.
There's quite a lot more than that. It's a very, very broad ecosystem out there, but information broadly falls into those buckets. Raw source of truth information, such as government, aggregated information that is curated by commercial data providers and consent based customer information.
[00:11:10] Unknown:
As far as the data sources that many sort of new organizations are dealing with that are kind of typified by the so called modern data stack and being integrated into their data warehouses with tools like Stitcher or Fivetran that are coming from a lot of these different SaaS tools. How do these external datasets kind of differ from those types of information that people are accustomed to working with, and what are some of the additional pieces of contextual data and metadata that you need to be able to propagate and provide for people as they're starting to integrate these into their analytical workflows?
[00:11:46] Unknown:
Internal data integration tools such as Stitcher and Fivetran, they pull together internal data. They if you're a bank, they pull together things like, when did I use the ATM, and what products do I have, and how much money do I have, and what are my expenses? We also help organizations use those same kinds of integration technologies to add in additional contextual information about the customer that that that comes from outside outside those 4 walls. So things like, okay, is there a third party way to verify that I actually have a job? Am I on a watch list or a sanction list, a government OFAC list, or am I on a fraud list of somebody who has red flag patterns of interacting with other institutions?
There's also, for for marketing use cases, things such as demographic information, so segment based information on people's profiles that is similar to the world of demographic information that's used in marketing analytics. It's still integrated inside the enterprise, but it's more context about the the customer that comes from outside of the 4 walls versus the interactions of the customer with, with the enterprise directly.
[00:12:57] Unknown:
And when you're talking about things like the demographic information and OFAC lists and pulling in these external data sources to enrich the data that they might already have about their customers or be able to gather more contextual data about the environment that they're working in. That brings in a lot of considerations around privacy and regulatory concerns, and I'm wondering if you can talk to some of the ways that you need to manipulate and manage that data to stay within those compliance requirements and just some of the technological aspects of being able to ensure appropriate access controls and auditing for people who are using these types of data sources?
[00:13:33] Unknown:
It's a great question. Privacy, compliance, and also security is another area that people don't just wanna stay within, but they wanna go above and beyond. Their enterprises are already ingesting a lot of these types of external data, but they're doing it through lots of different systems and processes. And and they're all trying to find ways to not just reduce the risks and meet their obligations, but go further and treat external data and internal data. At the end of the day, it's customer data, but treat them with the same processes to be able to understand that they have the rights to have the data that they have and that they use it in the appropriate way, and then it's all very secure. So privacy, compliance, and security are very different things.
On the compliance and privacy side, 1 of the basic questions is where does the data come from and with what consent is the access provided? And so we work with data partners and we conduct diligence on them. We understand data provenance. We do the same sorts of checks on providers that banks and insurers and others already do. We have a dedicated team of people that diligence the vendor, and we have a certification process that we believe is pretty unique and allows our clients to depend on our diligence and our contracts with the suppliers. So provenance is 1 thing. People need to understand the ways in which enterprises and the suppliers, and we, manage GDPR, CCPA, and other equivalent regulatory frameworks around the world. We operate in quite a few different countries. So when consumers request visibility into where their data came from or request that their information be suppressed, there's consumer protection regulation that defines processes for how that needs to be handled. So we help implement systems and processes and technology with our clients and for our own purposes that allow that to happen very efficiently.
Security is another interesting thing because if it's a bank matching their own customer information to third party data, then they need to make sure that their own customer information is protected and doesn't leak. For example, if social security number and email address and phone number is being pinged up against the API of a niche data provider that is a startup that has a very, very interesting and relevant pocket of information, the bank has to ensure that the data isn't being stored, it's not being transferred cross border. There's various different attack vectors for how that information might end up leaking, and, banks are quite rightly very, very conservative about that. So there has to be diligence on the vendor's systems in order to protect against any risk there, and there's also privacy enhancing techniques that can allow that matching to happen without exposing the bank's information, or the other way around as well, the supplier's information has a risk of being exposed through the bank.
There's no 1 silver bullet answer to things like privacy compliance, security,
[00:16:43] Unknown:
and what we do is a combination of a lot of different things and having the focus. 1 of the benefits of Demist, though, is that we do it as a centralized platform, which means we get to reuse that that across all of our customers and not just 1. On the note of things like GDPR and CCPA with the whole right to be forgotten and the requirement to be able to delete customers' data everywhere that it might live because of the fact that you're collecting this data from these various external sources and providing it to customers who then might incorporate that information to various analyses. I'm interested in understanding the workflow and life cycle of being able to notify those customers that this record needs to be expunged from wherever you're happening to use it and just helping them to manage that lineage tracking and provenance of those records to be able to, you know, whether it's delete the record from their data warehouse or understand that they need to rebuild their machine learning model because it incorporated that user's record, and they need to be able to align that and just some of the ancillary concerns that go around with these requirements and these regulations that are being more proactive about being able to manage customers' information.
[00:17:54] Unknown:
Yeah. And not they're not just ancillary concerns in many situations. They're core to the appropriate use of data, and there's lots of different technologies that are emerging, and there's an entire industry of technologies that are emerging to solve problem, but it's not a stand alone problem. We, like others, look at lots of different approaches, such as requiring customers to redownload data periodically, maintaining very rigorous transactional logs and ledgers to ensure we know what data went where when. We have processes with certain vendors to make sure the use of personal information is only during the model training process, and it's only temporarily used during the model training process.
So that there's protection when those things when those requests come in. It's not that straightforward for enterprises because these requests are happening today, and individuals are coming to data supplies, and they're requesting their data be purged. And that data is already cascading downstream into lots of different client systems and being copied and pasted and recopied and repasted and reused, and it does create this lineage challenge that people haven't solved for yet fully, and they haven't integrated that data into their lineage workflows. It's worth noting that different requirements, if it's a marketing and advertising use case versus a use case such as fraud or AML or financial crime, credit risk, insurance underwriting within a regulated institution where the institution has the consent from the direct customer.
So regardless of of where the 3rd party also got your consent, if you go to a bank and you ask them to give you a credit card or a mortgage or something like that, you're, in many ways, granting consent directly to the enterprise to use data and that is the most direct form of consent that simplifies the journey here and means that it's easier for enterprises to work with data that doesn't that isn't passed through lots of different systems. Back to the engineering side of this, the main thing that really helps here is logs and lineage and tracking where the data went in the most granular form, in a protected, secure way, and using that when requests come in and effectively flushing the cache when
[00:20:17] Unknown:
those requests come in. So digging more into the platform that you're building at Demist, I'm wondering if you can talk through some of the technical architecture that you've built out and some of the processes that you managed to be able to collect and clean up these various data sources and manage the auditing and access controls and understand the sort of usage patterns so that you can maybe decide this dataset isn't really valuable anymore. We're going to retire it. Or, you know, this is useful. We need to collect other datasets that are akin to this and just some of that overall process of the technical and operational architecture of your system.
[00:20:53] Unknown:
We have the upstream connectors into sources and systems and solutions that we build against a library of templates of upstream connector types that are tailored. We have a layer of schematization of the data, so when we're interacting with the system, we don't just, for example, code it as a string, it's coded explicitly as an address, or a street, or a city, or an email address, or the specific type. So at the most granular level, we're defining as we integrate the source, we're defining the types, and integrating against a preconfigured library, we have automation technologies to set those things up and and integrate them. Then once it's integrated, we run a technology that runs known sample accurate data where we can all systems.
So So we have a standardized, think of it, for example, in the business domain as a brick of businesses where we know the truth about the business. We know that it's actually a pizza shop and has this many employees and this much income and so on. And we, we systematically run that file constantly at low volumes against all providers, and we pay the providers for that. And that allows us to create a very objective set of descriptive statistics about the connectors. So we aggregate the metadata. Is it accurate? Is it high match rate? Is the data element stable? Are they changing over time? And are they orthogonal for the business processes that are being optimized?
That allows us to do things such as manage our own SLA obligations. Again, not just is the connector live or down, but has it changed? Is it returning the right data? Is it stable? So that's the upstream connector side and the monitoring of that upstream connector side. We have a layer on top of that, which we call recipes. Recipes are combinations of datasets around common business problems and business logic around that. Demist has an infrastructure that we've built that executes a DSL for the creation and execution of recipes. A recipe might be something like, take this attribute from this source and combine it with this other attribute, compare them to each other, run a waterfall against a third attribute from a third source, add some logic on top, divide it by 2, multiply it by 10, maybe even execute a predictive model in there. That recipe DSL is configured within our platform and executes at runtime, so that the shaping of the data can happen as part of the real time interaction with the customer.
And then finally, I already mentioned on the downstream customer side, we have a set of infrastructure and processes around the integration into downstream systems, and we run that through a microservice layer we call data APIs where we set up lots of different instances of data APIs that each customer configures and deploys into their own production workflows. So that's the end to end journey from source to use. And then we have quite a lot of infrastructure around logging and monitoring different use. And that can be everything from the more complicated situations, such as managing data compliance, down to the more basic situations, such as centralized billing.
Enterprises want to know, in aggregate, what they've spent across all data sources and get that and debug and reconcile those bills to their own actual usage. So we have billing and reporting. We have error logging where transactions fail. People need to know why they failed and modify their recipes accordingly.
[00:24:32] Unknown:
Is that answering your question, Tobias, in terms of the stack? Yeah. And mostly interested if there are any particular off the shelf tools that you're using and if there are any sort of custom solutions that you've had to build in house to be able to work with this specific problem domain and manage your internal systems in a efficient and scalable manner?
[00:24:50] Unknown:
So most everything I just listed, we've built in house because it didn't exist. But we, heavy users of cloud infrastructure. We build a lot of things on the AWS stack. So when it comes to managing the execution of of logic layer that is codified into these recipes, for example, we use things like AWS Lambdas. And when it comes to storing data, we'll use things like Athena and Redshift. And when it comes to managing logs, we'll use things we'll use various logging platforms on AWS. When it comes to integrating downstream systems, we integrate with clients GCP and Snowflake warehouses and other hosted SaaS platforms.
In terms of specific integrated commercial data providers, we work with, depending on the situation, but we do work with AI systems such as DataRobot or SparkBeyond. Sometimes we work with privacy enhancing technology systems such as Infer that provide some great capabilities that we embed with our clients. But most of it, Tobias, is developed in house and through our infrastructure and and technology team. In addition to using the infrastructure, there's a lot of automation technology around the infrastructure itself, infrastructure as code, which is a bit of a buzzword, but what that means in our context is because we're running across so many different availability zones and so many different customers and because we're handling a very sensitive resource, which is protected information and and clients' information, and because we have quite a lot of processes that govern how we handle that data, we have lots of different technologies to spin up and shut down, capacity on our infrastructure to manage releases,
[00:26:34] Unknown:
to set up single tenant environments with clients. So we have a lot of hand rolled technology around the infrastructure side too. As far as the datasets that you're working with, I'm wondering if you can give a sense of the sort of average or median size in terms of whether it's gigabytes or, you know, terabytes and how the relative scale of dataset and data volume that you're working with will inform the types of interfaces that you're making available to end users for being able to interact with that data and some of the, you know, data gravity concerns and considerations that play into how you provide that as a service to your end users? Some of the most valuable datasets are, you know, 100 of rows, not billions of rows.
[00:27:16] Unknown:
A lot of the connectors are transactional, so individual records, very small payloads, but SLAs matter a lot, and edge case scenarios matter a lot. But the actual size of the data itself is kilobytes of 20 features about 1 person. There are bulk data files that we work with, that we ingest, that we manage diffs on top of, 100 of millions and billions and tens of billions of rows, sometimes 100 of billions and sometimes more. These are gigabytes and terabytes. I don't know whether it goes above that. I presume it does, but I'm not sure how massive the datasets get. More often than not, as I mentioned at the start, they tend to be more short and fat rather than tall and skinny. It's not clickstream or log data that is 5 attributes, but trillions of rows that we have to process in a millisecond.
You know, we're working with clients that are dealing with millions of customers, not 100 of millions of customers and not thousands of customers, so millions of customers, but they're pulling together thousands, tens of thousands, hundreds of thousands of columns about those customers, and and they're having to have effective and efficient ways to process that as part of a transaction that might be sub second, not sub 10 milliseconds. These are big, chunky, important decisions. They're background screening a person. They're diligencing a business. They're, you know, running a mortgage application. And generally speaking, these things, if it takes, you know, you know, 1 second, 5 seconds, 30 seconds, that's very high performance in the eyes of the marketplace.
[00:28:50] Unknown:
As far as the access patterns, it sounds like it's largely query based where the end user is requesting a given record or a set of records, and they're not necessarily looking for a push based API of inform me whenever there's an update to this particular record or this particular user's information.
[00:29:07] Unknown:
There are also trigger based workflows like that where they're saying, inform me when something changes. And there's an access pattern as well where the data itself is egressed into clients' warehouses, and they do their own matching and their own trigger based monitoring. And the desire is for us to be responsible for pushing in changes to datasets and full file access to datasets as they come into our infrastructure. But the pool based pattern tends to be more preferred because it meets enterprises' requirements and allows them to get the freshest, most compliant element of data at the time of interaction with the customer.
[00:29:46] Unknown:
And as far as the work that you've been doing to build out Demist and get it to the point that you're at now, I'm wondering what have been some of the most complex or difficult engineering challenges that you've dealt with.
[00:29:57] Unknown:
The most complex challenges have been around the data itself. The ecosystem is not only diverse and fragmented and exciting and messy. If it wasn't, we wouldn't be in business. But the actual data within each data partner is messy. There are modeled and actual attributes. There are data elements that have lots of different unique definitions. It's not always obvious what the levels of the data represents, and it's not obvious how to entity resolution across vendors and within vendors. So getting from a very rich but messy world of fragments and pockets of information into a single customer view that actually makes sense, you know, is, and in my view, will be for a long time, 1 of the biggest and most interesting engineering challenges. People have been talking about that for as long as I've been working, and and we'll talk about that. So there's a lot of engineering challenges around data cleansing and data linking and data resolution.
Some other interesting engineering challenges have been all around, I mentioned it already in the infrastructure side, but configuration management and orchestration of the different interconnected systems that have to work in order to provide what is an enterprise grade system and set of promises to a bank, to an insurer, to another large enterprise on the back of these upstream data vendors that sometimes just don't operate in that way and aren't necessarily intended to operate in that way originally when they built their own infrastructure. So for example, upstream sources in emerging markets that still don't have APIs, or if they do have APIs, they have significant downtime, or they provide batch Databricks, but they don't have matching technology to go with it.
So the provision of a set of enterprise capabilities on top of that fragmented ecosystem provides a significant engineering challenge.
[00:31:48] Unknown:
On the topic of engineering challenges, 1 of the perennial problems of dealing with end users and being able to manage data access is the ways in which they want to access it. So, you know, everybody has dealt with the issue of having to, you know, wonder whether or not the FTP file has been uploaded to the right location so that I can run my downstream job. And, you know, I noticed when I was looking through your documentation on your site that you provide a number of different APIs and file transfer methods, and I'm wondering if you can talk to some of the strategies that you've built to be able to manage these different interfaces for customers to be able to access the data in the way that is most convenient for them while still being able to maintain your own sanity of providing access to all to these different datasets in various ways while still keeping the underlying data sort of well managed and not having to deal with a lot of duplication?
[00:32:42] Unknown:
Yeah. There's still a lot of that sort of stuff that has to happen in the marketplace, and certainly Demist is no different. You've gotta run check sums and various other things to make sure that you know you downloaded the right number of records versus what was expected, and what happens if there's a miss drop, and what are the downstream workflows that are affected? As I mentioned, we we integrate with a lot of downstream workflows at enterprises. Maintaining our own sanity is is important, but it's worth noting that enterprises have to manage this stuff themselves anyway. They're already accessing a lot of these datasets directly from source. It's just they have duplication of their own systems across each enterprise as they integrate with their systems. And, you know, we do that too, but we do it once and we share the benefit of that across our customer base. So we have teams of people that get notified when things don't land from our sources, and we pick up the phone, and we call the vendor, and we talk to them, and there's not necessarily a magical technology solution to that. Sometimes something went wrong and we need to manually rewire things or we need to rerun the transaction, and that's okay because we only have our team doing that once as opposed to every customer doing that all the way across the ecosystem.
We have alerting and notifications and workflows notify our teams and our client teams when things are not within expectations. I already mentioned that in the context of our metadata monitoring around statistical metadata as well, whether things are not just working or not working, but is the data very different to what was expected? Because people might have to train their downstream systems retrain their downstream systems if if things have changed materially. These systems and processes have to exist to make sure the data is flowing the way it should flow. They just shouldn't exist redundantly in every single enterprise and every single data provider. There's more and more platforms out there, not just Demist, but, you know, Snowflake has a data exchange, and Amazon has 1, and there's lots of different great capabilities out there where people are centralizing how this stuff is managed that is separate from the content itself.
[00:34:48] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes.
Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box. As you mentioned, there are a number of different data exchanges that have been growing up, and I know that you've been at this for a while. And I'm curious what your thoughts are on the sort of vendor specific solutions such as Snowflake for being able to provide data as a shared asset or what BigQuery has been doing versus what you're building into some of the different priorities and use cases that would lead somebody to use something like Demist, which is a very full featured platform versus just using a dataset that somebody provides through the Snowflake data sharing capabilities?
[00:36:16] Unknown:
Yeah. And we partner with Snowflake and Amazon Data Exchange and others. Those are some great emerging capabilities that we use for our own data ingestion as well in some situations. And they solve a piece of the technology problem. They don't solve the compliance and licensing problem. You still have to enter a contract with the the third party data source. And the data may or may not be there and where it is, it's a simple mechanism to get data. Our clients work with those platforms too if what they want is just a well defined, curated, available bulk brick of data.
So it's a CSV and an s 3 bucket, or it's a Snowflake table, or through Google, you can do dataset shares without redundant copy and paste. In Snowflake, you can do the same thing, Redshift now as well. If it's a curated dataset that is just moving from point a to point b, that's where you don't necessarily need a platform like DNIST. If you already know what Databrick you want, you don't need the help in discovery. You don't need the help in deployment into real time systems. The more full featured service providers like us, people would use us where they really need that last mile delivery and some of that recipe capability. You know, we pride ourselves on going end to end with the customer and actually getting things out of the lab and into production, into workflows with the right combination of data, typically under a single contract and with ongoing monitoring of that real time deployed use case versus having to build all do all of that discovery, integration, and licensing directly on top of those exchanges that are out there.
[00:37:48] Unknown:
For working with these various external data sources, I'm wondering if you can just talk through the overall workflow of identifying what a new useful source might be and then going through onboarding it, integrating it into your systems, setting it up in the billing for customers to be able to find it and pay for it and just the end to end workflow from, you know, you deciding this is a useful data source or somebody requesting a particular data source to it being available in your marketplace?
[00:38:17] Unknown:
So data sources approach us, and also our clients have great ideas, great road maps. They engage data vendors all the time. They don't yet know whether there's a there there with a data vendor, you know, whether it's really predictive and useful and valuable or whether it's not. But 1 of the great conundrums in the marketplace is in order to know whether there's a there there, they have to bear all of the cost and pain of the data engineering and the contracting and the diligence. And often, it will take upwards of 6 months to go through that pain, and that's before they know if they really wanna buy the dataset product. So what happens is people just tend to stay with the large incumbents that they already work with. It's too painful and marketplace. But in our situation, the clients will refer the vendor to us and say, oh, we work with Demist. You know, why don't you go and get onboarded into their platform? And then we'll test it, and we'll try it. And if we like it, we'll use it, and the vendor's happy and the customer's happy. And and we bear that cost and pain. Once we get past the discovery phase, we know that the questionnaires that we built into our own certification process.
We then diligence the company, and we review what they share with us, And then we run data tests to make sure the data is accurate against these truth files, and then we integrate the connectors, then we make it available in our catalogs. Sometimes we make it available publicly. Sometimes it's private. Sometimes a client might have a proprietary relationship with 1 of their partners, which involves data sharing, first party data sharing. And so we'll go through the same sort of processes with the supplier with the data source, but we'll publish it only to that client's own organizational configuration.
[00:39:58] Unknown:
As you have been working on Demist and dealing with all these different data sources and all of the, you know, data cleanliness issues that I'm sure crop up and being able to manipulate them and store them in a relatively uniform fashion and provide all of the supporting infrastructure to make those datasets usable. I'm wondering what are some of the most sort of interesting or far reaching lessons that you've learned about sort of data engineering as a discipline and the state of the industry as a whole?
[00:40:27] Unknown:
I wouldn't claim to be as much of an expert in data engineering as you, and and I'm sure many of your listeners. It really is the basics, haven't been applied yet that people have known about for a long time but haven't been applied in the world of external data. I can't tell you how many times that there's been value created by something simple like just talking to the vendor and reading their documentation and actually implementing their documentation the way it was intended, like putting a plus in front of the phone number for an international phone number or, you know, putting the right country code in as opposed to forgetting the country code. I mean, these aren't necessarily data engineering challenges or best practices. They're just being careful and rigorous and integrating things in the right way. Because what will often happen is, you know, a bank will work with a vendor. They'll throw the file over the fence, the vendor will throw it into their system, they'll throw the file back, but often people won't even actually look at the data, and then it lands back at the bank. They're like, the the data's crap. No. The data's not crap. It's just nobody had the bandwidth or focus to optimize the matching and the logic and do the basic stuff right.
Other things like you get misspelled fat fingered, like, business names, for example. Like, you know, it might be instead of demist data limited, it will be, you know, Demist space data with a dot in it or something like that. We'll do con canonicalization of the input data, sometimes using third party data providers, sometimes using our own standardization engines that we've built into our platform. But getting basics into having the right clean input data means that the match rate is is high quality. It's not magical data engineering. It's more just getting the inputs right and then conducting tests across the outputs and standardizing how that's done. And I think the business problems, it's been surprising the business problems that we've solved over the years as well.
While our enterprises always wanna talk to us about the bright, shiny, innovative things they're trying to do with aerial imagery or footfall traffic or sentiment. Like, there's lots of great things out there, and they they help. But the things that really move the needle for a lot of enterprises is, again, the basics. It's how do I know this is a real person? How do I know they have a job? How do I know this company exists? How do I do it at scale in a predictable way across lots of sources and lots of countries and lots of products? And that's what's most interesting for me is the the boring stuff that keeps time and time again coming up with customers that isn't new in the zoo in terms of business problems, but the maturation of the ecosystem, technology means that they can now solve it in a much, much bigger and better way for everyone's benefit.
[00:42:55] Unknown:
As far as the ways that your customers are applying the data that you're providing as a service, what are some of the most interesting or innovative or unexpected ways that you've seen it applied or some of the most interesting insights that you've seen people gather from the datasets that you provide?
[00:43:11] Unknown:
So some of that we can't share, of course, because it's proprietary to the customer and confidential. The insights that people come up with people are usually looking to solve something that they already know and that they're already doing in some way when they get started. A banker is already collecting data in a printed financial statements from a business, or a lender is already getting, you know, photocopies of passports and driver's licenses, or an insurer is already trying to find out what vehicle, make, model, year you have and when your house was out of the fire station. These things are all now accessible and pre syllable from the data ecosystem without friction from the customer.
It's, you know, very not insightful to bias. People already know if you're close to a fire station or a fire hydrant and you own a house, that that house is therefore less likely to burn down. It's not a new insight that that is the case. It's just a new method of getting the information. People already know that if a business has strong cash flow, it's gonna be a better credit risk. It's just that instead of getting it from audited p and l's through PDFs and faxes and, you know, people talking to bankers, that you can now get it through a single click and access to accounting systems. It's not a new insight. It's just a better way to get to the insight. I mean, our clients do do various really interesting, clever things. When they step back and they scan the ecosystem, they find correlations between different patterns, different spending patterns, or different weather patterns, and things that they haven't thought of before, whether it's a an auto insurer figuring out that if you drive east and west to commute versus north and south, you're more likely to crash because you're driving into the sun, sunset and sunrise, which, you know, affects your insurance propensity or on the fraud side, interesting insights that come out from how frequently data has changed and the comparison across datasets which identify people that are creating fake profiles. There's some insights there. I mean, there's always those insights in credit and lending where they find the use of unstructured data such as text on application fields and look at the grammar and the way in which people write things that in some countries, people can use as part of an underwriting process. So lots of interesting things that are happening, but I get, as you can tell, more interested by the boring stuff. In your own experience
[00:45:39] Unknown:
of building and growing the Demist company and the platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:45:47] Unknown:
It's client centricity and tenacity and not just buzzwords you put on a poster in in the office that building any kind of enterprise technology business is hard and has lots of twists and turns, and building a team around that is hard. And at the end of the day, the true north around that is listen to customers and be open with them and work hard, things will go up, things will go down. But if you put a technical team that doesn't have an ego with a customer that actually has a real problem, and then you get out of the way, then lots of great things can happen and smart people will build smart capabilities around delivering value. I've also learned over the years that it's really hard to stay focused, especially, you know, when you haven't raised the sorts of you know, galactically enormous amounts of money that you read about in TechCrunch, where you have customers solving real problems, but they're not always directly down the line of what you thought your product strategy was going to be. But it they're close enough that you build related technology and you end up proliferating into a lot of different use cases. And, you know, every 6, 12, 18 months, you've got to pop your head above the cases. And, you know, every 6, 12, 18 months, you've gotta pop your head above the parapet and refocus and refocus and refocus. And I've learned that that's a hard thing to do. And that again, you know, if you have great people and good customers and they're solving real problems and you put them all in the room together and talk about those issues openly with them, that it trumps all of the preplanning that you thought you knew in advance of building things.
You know, better ideas come from the marketplace than come from a whiteboard. That's the main thing that I've learned over the years through engaging with the marketplace. And we're pretty excited about where we are and what's in store for the future. The market's definitely matured and enterprises are far more effective at capturing value from data. They, solving some of the biggest problems they've ever solved, especially in a post COVID world. Everything's digital, the ability to customize journeys with customers is much, much better than it used to be, the enterprise tech stack's more mature, and on the ground floor of people harnessing value from data, whether it's internal or external.
The future is bright around solving these problems across the entire stack.
[00:48:03] Unknown:
And for people who are interested in being able to incorporate external data sources into the analyses and machine learning workflows that they're building? What are some of the cases where a Demist is the wrong choice and they might be better served by either using something like the Snowflake data sharing or building out their own internal capacity for being able to collect and clean up these datasets?
[00:48:25] Unknown:
As I mentioned earlier, if it's a really plain vanilla brick of data that is well curated and well understood, then sometimes it's easier to just go directly to sources like that, especially if it's a very large dataset where enterprises don't wanna deal with the storage cost associated with that. The the mechanisms offered by those cloud platforms can be a pretty efficient way to do things. There's a very important build versus buy question around data management. Some enterprises do see the way in which they work with external data as a strategic advantage. There might be a solution provider that's building fraud scores or there might be a bank that is accessing pockets of data that are so sensitive that they have a unique advantage.
In those situations, they might just I mean, it might still be the right choice to work with us, but people just don't want to because it's not something that they wanna in any way outsource in any way, shape, or form. They're the 2 areas. Oh, sorry. There's a 3rd area as well. Sometimes it genuinely is just a single source problem. People know what the source is. There is only 1 source. Integrating with it is not that painful. We're a retailer. You know, we think of ourselves sometimes as a supermarket. You come to the supermarket because you wanna buy lots of different fruits and vegetables, and you wanna make sure their diligence back to source and they're safe to eat, and you might wanna change what you buy every week. But if all you bought every week was bananas,
[00:49:48] Unknown:
and you bought a lot of them, then go to the farm. Don't go to the supermarket. It's cheaper. That's when it's better not to use us. It's a good metaphor. And I think 1 of the interesting challenges there too is that it might start with 1 data source, and you say, okay. I'm just gonna build it myself. And then down the road, you add a second data source. At At some stage, you get to a tipping point where it makes sense where you actually are better served with buying something from something like Demist, but you've already put in so much effort that you then suffer from the sunk cost fallacy of, well, I've already built out all these other systems and, you know, they're becoming a little unmanageable, but what's 1 more data source? And so there there's always that challenge of, you know, at what point do you decide that you've gone too far and you need to just throw it all away and go with the prepackaged option? There's an analogy to product development and engineering, which is it's
[00:50:37] Unknown:
always hard to get people to focus on refactoring anything. It's always the thing that you do later, but then when you do do it, you always breathe a sigh of relief. And it always ends up taking a lot less time than people thought it would, and creates a lot more benefit than people thought it would, but it's always hard to prioritize doing that housecleaning. So, yeah, that does happen. And you've pretty accurately summarized what some of our sales cycles look like. People take 6, 12, 18, 24 months sometimes of doing things 1 at a time directly with the marketplace.
And then, you know, some of them eventually save a lot, and some of them just keep chipping away and incrementally adding 1 more if then statement, 1 more piece of gaffer tape, 1 more thing into their system. Now regulation and compliance is a really interesting question that comes up. And there really is this, you know, if you add in just 1 more thing and 1 more element and 1 more stream and 1 more contract and 1 more systems integration directly to source, and don't centralize it. At some point, the chief data officer comes along, or even the regulator knocks on the door and talks to the chief data officer and says, where did you get all of this data? You know, Facebook's telling their customers where they got the data and giving them the right to manage it. So is Google. So is everyone else. You're a bank. Can you tell us where you got all of this third party data and how you manage it in 1 place with 1 set of reports?
And that will create the impetus to manage this stuff centrally.
[00:52:03] Unknown:
As you continue to build out the Demist platform and work with your customers, what are some of the things that you have planned for the near to medium term, either in terms of new datasets or new interfaces or new platform capabilities or general projects that you're excited to get started with? We're very excited about some of the new technologies around
[00:52:22] Unknown:
consent management and privacy enabling technologies that double down on what we already do and what that means further unlocking data that's in the ecosystem. We're very excited about the last mile and self serve capabilities. We usually are a fairly white glove partner of customers for the first 1 or 2 use cases, but then as they start to get good at working with data sources, you know, customers do amazing things, and they build great recipes that we'd never thought of before, and they deploy them into systems that we hadn't thought of before. We've recently been launching a lot of capabilities around that.
On the last mile stuff, we're building a more verticalization in our solutions so that we can focus on key business problems that we know work. You know, welcome to my supermarket, here's a cereal aisle, but we know these 3 boxes work in this particular problem domain. We can save you time. If you want different cereal, no problem. Put it in the basket. We'll help you out. But we know these 3 work relative to each other. We know they work in the context of this business problem. So we're excited about that, and we are excited about the emergence of enabling cloud capabilities like Databricks type delta sharing and what that means for data ingestion, table sharing through platforms like Snowflake and Redshift and Google, and what that means for the ability to do 2 way and 1 way data enrichment and sharing, build workflows around that. So we're excited about a lot of the development of the technology ecosystem and ultimately with the broad vision of our view at Demist, which is the ecosystem of external data is this vast, untapped resource. For all of the data that's out there, people use relatively little of it in production.
And it's not because the data doesn't exist, it's because there's frictions to deploy it and there's frictions to get it. Our average customer uses typically more than 10 times our average known customer. Even if we don't do anything that clever with the data, that speaks to how the market isn't as efficient as it could be. And when you take friction out of accessing something and doing something, whether it's data or AI or BI or anything else, people don't just do it more cheaply. They do it more often. So we're excited about unlocking access so that, people can gladly use more data and capture more value.
[00:54:43] Unknown:
Are there any other aspects of the work that you're doing at Demist in the overall space of collecting and managing and incorporating these external datasets for analytical purposes that we didn't discuss yet that you'd like to cover before we close out the show? I just encourage folks if they
[00:54:59] Unknown:
are in the ecosystem, if they're managing a direct integration with a single data provider and they're looking to consider different alternatives, take a look at demist.com, sign up, play around, provide some feedback. We have a consumption based pricing model so people can sign in and try it and pay as they go if they want to. And we we see it as a very, very big ecosystem in a very big marketplace that is unlocking a lot of value. So I just encourage folks to reach out and engage with us, whether they're a supplier, a consumer, an individual who's just interested in in the space and and talking about some of the things we talked about on today's show.
[00:55:38] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in tooling or technology that's available for data management today. There's so many great solutions out there
[00:55:56] Unknown:
for different pieces of the data engineering stack, and we get pigeonholed into our little silo of external data, but I know the question you're asking is much more broad than that. I think the picks and shovels underneath AI and BI and decision systems for getting clean data is still nowhere near as simple as it could be, and I believe that data prep and cleansing is, I believe in my bones, at some point, someone somewhere is going to come up with this magical system that just you put messy data in and clean data comes out. I haven't seen that yet. I haven't seen the system that does the basics that we all learned about in either, you know, computer science 101 or predict or Statistics 101, like, let's remove the outliers, let's take this numerical value and quantize it, let's apply a log or an exponent to this column. Let's change m and f to male and female. Let's get rid of the underscores and make it capitalized.
Let's, you know, delete this erroneous column that's always filled with a null and is totally a waste of time. Like, all of those things that every data engineer on the planet still spends, you know, in my view, you know, a lot, if not the majority of their time on. Systems should just do that for you, and we all should be spending our time thinking about the more interesting problems, like how do you linking and matching an entity resolution and deployment to maintenance. But that to me is a big gap. And look, if anybody on the show
[00:57:30] Unknown:
knows of that tool and wants to point me to it, I'd be a very happy customer. But I'm still waiting for the day when data prep has an easy button. Yeah. There are definitely plenty of vendors that would like to tell you that it exists, but eventually, you get to the point where the easy button stops working and you still have to do all those same things.
[00:57:47] Unknown:
Yeah. And then you throw out the tool, and you end up just jumping back into Python and and writing it yourself. And maybe the answer isn't a vendor solution. Maybe it's an open source solution, and people are developing some great libraries in Python to do various subsets of this. But the world is still quite
[00:58:02] Unknown:
early in its development of these capabilities, and I'm sure lots of great things will come out that solve that problem over time. Well, thank you very much for taking the time today to join me and share the work that you're doing at Demist Data. It's definitely a very interesting problem domain and an interesting set of capabilities that you've built out. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It's great to connect. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Mark Hokey: Introduction and Background
Overview of Demist and Its Services
Industries and Use Cases for External Data
Types of External Data Sources
Privacy, Compliance, and Security in External Data
Technical Architecture of Demist
Data Volume and Access Patterns
Engineering Challenges in External Data Management
Onboarding and Integrating New Data Sources
Customer Applications and Insights
Lessons Learned in Building Demist
When to Use Demist vs. Other Solutions
Future Plans and Exciting Developments
Closing Remarks and Contact Information