Doing DataOps For External Data Sources As A Service at Demyst

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS

systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

Your host is Tobias Macy. And today, I'm interviewing Mark Hokey about Demist, a platform for operationalizing

external data. So, Mark, can you start by introducing yourself? Hi, Tobias. Great to meet you.

I'm Cookie. I'm the founder and CEO of Demist Data.

And my background is in the intersection of data and analytics.

I've been in this space for 20 some years.

I'm very interested in the world of operationalizing

external data into client workflows.

And do you remember how you first got involved in data management?

I

used to be more focused in my career in analytics, and then we had a business that was purchased by a bureau

called ChoicePoint that became part of LexisNexis

and had

a team of

of really incredible data scientists

as part of that,

who

had that typical challenge of just not even being able to get their hands on data.

So I became very interested in the challenge of why is it when we're supposedly awash with data

that people who

can do great things

can't tap into it. I started to look at the the problem underneath the analytics and have spent I spent a lot of time in the in the context of Demist

researching and building out technologies around that.

And so in terms of what you're building at Demist, can you give a bit of an overview about what it is that you're building there and some of the story behind it? So Demist is building an external data ops platform.

And

what we help enterprises do is is discover,

curate,

contract with, and

operationalize

external data.

An example of that is

a bank, for example, that needs to verify

consumers

or run financial crime checks

typically has to

work with an external ecosystem of

of

commercial data providers and open data providers to find out who people are,

who they say they are, they have a job, they're a real person.

And

the world of external data vendors is very fragmented, and they all have idiosyncratic interfaces and

different contracting approaches. And

big enterprises, banks, insurers, and other enterprises have very, very high friction

in integrating with and deploying those data sources. So we're building a Demist as a 1 stop shop to do that,

under

under 1 API, 1

1 contract.

We've been in business for 11 years, and we help some of the world's

leading

banks, insurers, fintechs, insurtechs,

and we tap into what we believe is the richest ecosystem

of information

to help

solve relevant use cases.

As far as the

services and systems that you're building out to be able to

provide these external data sources to your customers, I'm wondering

what are some of the capabilities that are necessary for being able to

collect these various data sources, get them cleaned up, and presentable

for the various organizations to be able to consume and incorporate into their own analytics workflows?

Well, there's upstream,

the sources themselves, and there's downstream, which is where we egress our data. Upstream,

we've built our own technology platform that integrates with DataSource's

own

APIs. Everybody's got a slightly different API. They've got different schema. They've got different types.

Some of them

have batch data

access,

streaming,

synchronous APIs, asynchronous APIs,

consent based access,

the whole gamut. So we have thousands and thousands of connectors that we've built 1 at a time into our platform.

Downstream,

we are allowing our clients to access it into the systems that they already use. So there's CRMs,

Salesforce, Dynamics, people looking to tap into the value of data into those systems. There's

API gateway technology,

like MuleSoft and Apogee and

other b to b data gateways.

There's

data warehousing technologies, Snowflake,

Redshift on AWS,

data lakes in s 3,

and there's decision engines

in banking.

It's systems like

experience power curve technology or or might be insurance. It's, policy management system. So there's a variety of systems downstream as well, and we have adapters to allow people to pull the data. So we provide the the harmonization and standard schema and then, provide adapters into downstream systems.

In terms of the customers that you're working with, I'm wondering if you can give a sense of the sort of types of industries that they're working in or the types of verticals or problems that they're trying to solve when they're incorporating these different external data sources

and when it's not sufficient to use the data sources that are already internal to an organization?

So external data

is

really, by definition, when

an enterprise is solving a problem where the outside world knows more about the customer than the enterprise does. So things like customer acquisition and onboarding,

where you're new to a bank, new to an insurer, new to a telco,

where you don't have a rich, long history with that enterprise.

They're the sorts of problems we tend to focus on. Now almost all large enterprises are using external data already. They're working with 1 or 2 credit bureaus. They're working with various other external data providers already.

Some problems, though,

by definition,

also require a myriad of external sources. They require things like

waterfalls and comparisons across external data sources.

Some problems, for example, are more dynamic,

and

they require

the constant evolution of more and more and more data sources. So we tend to get involved in those sorts of problems as well. We're operating in a lot of different verticals

and operating with very, very

large global

banks and insurers as well as scaling startups that are Fintechs and InsurTechs, working with travel and tourism. We're working with

professional services firms

and solution providers. So there's a whole range of different applications, but broadly,

the domain of target customer is somebody who needs to operationalize and use external data to understand their customer, and that's broadly customers that are new to that institution

or where they're trying to harness external information to service that customer that they don't have in their 4 walls. And what's really interesting, 1 of the things that's helping us scale is that the internal data engineering

stack

has matured a lot in the last few years,

and that's allowed us to work with customers that have got their internal data into the right place and into the right gateways and technologies and catalogs and data lakes. And we've effectively bolted onto that and said to customers, we'll work within that stack. Now that you've got your internal data house in order, let's help to organize external data into the same infrastructure. Because as far as internal

consumers and internal data engineers and data scientists at an enterprise are concerned, they don't necessarily care whether it's internal or external data. They just want the stack to work with both. But the external data has to be conformed into that internal state.

And as far as these external data sources, I'm wondering what are some examples of the types of datasets that you're working with and providing to these different organizations

and some of the

potential origins of those datasets, whether it's something that you have to collect on your own and aggregate or if you are taking maybe,

you know, publicly available government data but coalescing it into a form that's easier to work with and just some of the ways that just some of these types of data that are useful

despite the fact that they're originating from outside an organization's boundaries?

So there are some sources.

If folks wanna take a look at demist.com,

people can sign up and have a look at the sources. Some of them are government, as you mentioned, things like business registries.

You know, when was the

small business or large business registered, and who are the directors of that business? There's also commercial data providers,

like the bureaus themselves, Equifax, Experian,

and many, many other smaller commercial data providers that have different pockets of information that have been

contributed or aggregated over time that includes

customer information,

business information, property and address information.

So for example,

the insurance space,

we partner with a lot

of data aggregators that pull together things like court records and building permits

to understand

where the homes have been renovated and what types of roofs they have, and do they have swimming pools in the backyard and

motor homes and security systems and various things that property insurers

need to understand.

These are often data sources that enterprises already ask customers about through manual workflows, but by going through these routes, processes can be automated.

Another domain of external data sources is

consent based

sensitive information, such as people's credit reports or businesses'

transactional information that comes from

accounting systems, such as Xero or Intuit, where the

consumer or the business is granting consent to our clients to access that information, and on behalf of the client,

safely and securely pulling that information through.

There's

quite a lot more than that. It's a very, very broad ecosystem out there, but information broadly falls into those buckets.

Raw source of truth information,

such as government,

aggregated information that is curated by commercial data providers and consent based customer information.

As far as the

data sources that many sort of new organizations are dealing with that are kind of typified by the so called modern data stack and being integrated into their data warehouses with tools like Stitcher or Fivetran

that are coming from a lot of these different SaaS tools.

How do these external datasets kind of differ from those types of information that people are accustomed to working with, and what are some of the

additional pieces of contextual data and metadata that you need to be able to propagate and provide for people as they're starting to integrate these into their analytical workflows?

Internal data integration

tools such as Stitcher and Fivetran,

they pull together internal data. They if you're a bank, they pull together things like,

when did I use the ATM, and what products do I have, and how much money do I have, and what are my expenses?

We also help organizations

use those same kinds of integration technologies to add in additional contextual information about the customer that that that comes from outside outside

those 4 walls. So things like, okay, is there a third party way to verify that I actually have a job? Am I on

a watch list or a sanction list, a government OFAC list, or

am I on a fraud list of

somebody who has red flag patterns of

interacting with other institutions?

There's also, for for marketing use cases,

things such as demographic information, so segment based information on people's

profiles that is similar to the world of demographic information that's used in marketing analytics.

It's still integrated inside the enterprise, but it's more context about the the customer that comes from outside of the 4 walls versus

the interactions of the customer with, with the enterprise directly.

And when you're talking about things like the demographic information and OFAC lists and pulling in these external data sources to enrich the data that they might already have about their customers or be able to gather more contextual data about the environment that they're working in. That brings in a lot of considerations around privacy and regulatory concerns, and I'm wondering if you can talk to some of the ways that you need to manipulate

and manage that data to

stay within those compliance requirements and just some of the technological aspects of being able to

ensure appropriate access controls and auditing for people who are using these types of data sources?

It's a great question. Privacy,

compliance, and also security is another area that people don't just wanna stay within, but they wanna go above and beyond. Their

enterprises are already ingesting a lot of these types of external data, but they're doing it through lots of different systems and processes. And

and they're all trying to find ways to not just reduce the risks and meet their obligations, but

go further and treat external data and internal data. At the end of the day, it's customer data, but treat them with the same processes to be able to understand that they have the rights to have the data that they have and that they

use it in the appropriate way, and then it's all very secure. So privacy, compliance, and security are very different things.

On the compliance and privacy side,

1 of the basic questions is where does the data come from and

with what consent is the access

provided?

And so we work with data partners and we conduct diligence on them. We understand

data provenance. We do the same sorts of checks on providers that banks and insurers and others already do. We have a dedicated team of people that diligence the vendor, and we have a certification process that we believe is pretty unique

and allows

our clients to depend on

our diligence and our contracts with the suppliers. So provenance is 1 thing. People need to understand

the ways in which

enterprises and the suppliers, and we, manage

GDPR,

CCPA, and other equivalent

regulatory frameworks around the world. We operate in quite a few different countries. So when consumers

request visibility into where their data came from or request that their information be suppressed,

there's consumer protection

regulation that

defines processes for how that needs to be handled. So we help implement systems and processes and technology with our clients and for our own purposes that allow that to happen very efficiently.

Security is another interesting thing because

if it's a bank

matching

their own customer information to third party data,

then they need to make sure that their own customer information is protected and doesn't leak. For example, if social security number and email address and phone number is being pinged up against the API of

a niche data provider that is a startup that has a very, very interesting and relevant

pocket of information, the bank has to ensure

that the data isn't being stored, it's not being transferred cross border. There's various different attack vectors for how that information might end up leaking, and,

banks are quite rightly very, very conservative about that. So there has to be diligence on the

vendor's systems in order to protect against any risk there, and there's also privacy enhancing techniques that can

allow that matching to happen without exposing

the bank's information,

or the other way around as well, the supplier's information

has a risk of being exposed through the bank.

There's no 1 silver bullet answer to things like privacy compliance,

security,

and what we do is a combination of a lot of different things and having the focus. 1 of the benefits of Demist, though, is that we do it as a centralized platform, which means we get to reuse that that across all of our customers and not just 1. On the note of things like GDPR and CCPA with the whole right to be forgotten and the requirement to be able to delete customers' data everywhere that it might live because of the fact that you're collecting this data from these various external sources and providing it to customers who then might incorporate that information

to various analyses. I'm interested in understanding the

workflow and life cycle of being able to notify those customers that this record needs to be expunged from wherever you're happening to use it and just helping them to manage that lineage tracking and provenance of those records to be able to, you know, whether it's delete the record from their data warehouse or

understand that they need to rebuild their machine learning model because it incorporated that user's record, and they need to be able to align that and just some of the ancillary concerns that go around

with these requirements and these regulations that are being more

proactive about being able to manage customers' information.

Yeah. And not they're not just ancillary concerns in many situations. They're core to the appropriate use of data, and

there's lots of different technologies that

are emerging, and there's an entire industry of technologies that are emerging to solve

problem, but it's not a stand alone problem.

We, like others,

look at

lots of different approaches,

such as

requiring customers to redownload data

periodically,

maintaining very rigorous

transactional logs and ledgers to

ensure we know what data went where when.

We

have processes with certain vendors to make sure the use of personal information is only during the model training process, and it's only temporarily used during the model training process.

So that there's protection when those things

when those requests come in. It's not that straightforward for enterprises because these requests are happening today, and

individuals are coming to

data supplies, and they're requesting their data be purged.

And that data is already cascading downstream into lots of different client systems and being

copied and pasted and recopied and repasted and reused, and it does create this

lineage challenge that people haven't solved for yet fully, and they haven't integrated that data into their lineage workflows.

It's worth noting that

different requirements, if it's a marketing and advertising use case versus a use case such as fraud or AML or financial crime,

credit risk,

insurance underwriting

within a regulated institution

where the institution

has the consent from the direct customer.

So regardless

of of where the 3rd party also got your consent,

if you go to

a bank and you ask them to give you a credit card or a mortgage or something like that,

you're,

in many ways, granting consent directly to the enterprise to use data and that is the most direct

form of consent that simplifies the journey here

and means that it's easier for enterprises to work with data that doesn't that isn't

passed through lots of different systems.

Back to the engineering side of this, the main thing that really helps here is

logs and lineage and tracking where the data went in the most granular form, in a protected, secure way,

and using that when requests come in and effectively flushing the cache when

those requests come in. So digging more into the

platform that you're building at Demist, I'm wondering if you can talk through some of the technical architecture that you've built out and some of the processes that you managed to be able to

collect and clean up these various data sources and manage

the auditing and access controls and

understand

the sort of usage patterns so that you can maybe decide this dataset isn't really valuable anymore. We're going to retire it. Or, you know, this is useful. We need to collect other datasets that are akin to this and just some of that overall process of the technical and operational architecture of your system.

We have

the upstream

connectors into sources and systems and solutions that we build against a library of templates of upstream connector types

that are tailored.

We have

a layer of schematization of the data, so

when we're interacting with the system, we don't just, for example, code it as

a string, it's coded explicitly as an address, or a street, or a city, or an email address, or the specific type. So at the most granular level, we're defining as we integrate the source, we're defining the types,

and integrating against a preconfigured library, we have automation technologies to set those things up and

and integrate them. Then once it's integrated,

we

run a technology that runs known sample

accurate data where we can

all

systems.

So So we have a standardized, think of it, for example, in the business domain as a brick of

businesses where we know the truth about the business. We know that it's actually a pizza shop and has this many employees and this much income and so on. And we, we systematically run that file constantly

at low volumes against all providers, and we pay the providers for that. And that allows us to create a very objective

set of descriptive statistics about the connectors. So we aggregate the metadata. Is it accurate? Is it high match rate? Is the data element stable? Are they changing over time?

And are they orthogonal

for the business processes that are being optimized?

That allows us to do things such as manage our own

SLA obligations.

Again, not just is the connector

live or down, but

has it changed? Is it returning the right data?

Is it stable?

So that's the upstream connector side and the monitoring of that upstream connector side. We have a layer on top of that, which we call recipes.

Recipes are combinations of datasets around

common business problems

and business logic around that. Demist has an infrastructure that we've built that executes a DSL

for the creation and execution of recipes. A recipe might be something like, take

this

attribute from this source and combine it with this other attribute, compare them to each other,

run a waterfall against a third attribute from a third source,

add some logic on top, divide it by 2, multiply it by 10,

maybe even execute a predictive model in there. That recipe

DSL is

configured within our platform and executes at runtime, so that the shaping of the data can happen as part of the real time interaction with the customer.

And then finally, I already mentioned on the downstream customer side, we have a set of infrastructure and processes around the

integration into downstream systems, and we run that through a microservice layer we call data APIs where we set up lots of different instances of data APIs that

each customer configures and deploys into their own production workflows.

So that's the end to end journey from source to use.

And then we have quite a lot of

infrastructure

around

logging and monitoring different use.

And that can be everything from the more complicated situations, such as managing data compliance, down to the more basic situations, such as centralized billing.

Enterprises want to know, in aggregate,

what they've spent across all data sources and get that

and debug

and reconcile those bills to their own actual usage. So we have billing and reporting.

We have error logging where transactions fail. People need to know why they failed and modify their recipes

accordingly.

Is that answering your question, Tobias, in terms of the stack? Yeah. And mostly interested if there are any particular off the shelf tools that you're using and if there are any sort of custom solutions that you've had to build in house to be able to work with this specific problem domain and manage your internal systems in a efficient and scalable manner?

So most everything I just listed, we've built in house because it didn't exist. But we,

heavy users of cloud infrastructure. We build a lot of things on the AWS stack. So when it comes to managing

the execution of

of logic layer that is codified into

these recipes, for example, we use things like AWS Lambdas. And when it comes to storing data, we'll use things like Athena and Redshift. And when it comes to

managing

logs, we'll use things we'll use various logging platforms

on AWS. When it comes to integrating downstream systems, we integrate with clients GCP and Snowflake

warehouses

and other

hosted SaaS platforms.

In terms of specific

integrated commercial data providers, we work with, depending on the situation, but we do work

with AI systems such as DataRobot or SparkBeyond. Sometimes we work with privacy enhancing

technology systems such as Infer

that provide some great capabilities that we

embed with our clients.

But most of it, Tobias, is developed in house and through our infrastructure and and technology team. In addition to using the infrastructure,

there's a lot of automation technology around the infrastructure itself, infrastructure

as code, which is a bit of a buzzword, but what that means in our context is because we're running across so many different availability zones and so many different customers and because we're handling a very sensitive resource, which is protected information and and clients' information,

and because we have

quite a lot of processes

that govern how we handle that data, we have lots of different technologies to spin up and shut down,

capacity on our infrastructure

to manage

releases,

to set up single tenant environments with clients. So we have a lot of hand rolled technology around the infrastructure side too. As far as the datasets that you're working with, I'm wondering if you can give a sense of the sort of average or median size in terms of whether it's gigabytes or, you know, terabytes and how the relative

scale of dataset and data volume that you're working with will

inform the types of interfaces that you're making available to end users for being able to interact with that data and some of the, you know, data gravity

concerns and considerations that play into how you provide that as a service to your end users? Some of the most valuable datasets are, you know, 100 of rows, not billions of rows.

A lot of the connectors are transactional, so individual records, very small payloads, but SLAs matter

a lot, and

edge case scenarios matter a lot. But the actual size of the data itself is kilobytes of 20 features about 1 person.

There are

bulk data files that we work with, that we ingest, that we manage diffs on top of, 100 of millions and billions and tens of billions of rows,

sometimes 100 of billions and sometimes more.

These are

gigabytes and terabytes. I don't know whether it goes above that. I presume it does,

but I'm not sure how

massive the datasets

get. More often than not, as I mentioned at the start, they tend to be more short and fat rather than tall and skinny. It's not

clickstream

or log data that is 5 attributes, but

trillions of rows that we have to process in a millisecond.

You know, we're working with clients that are dealing with millions of customers,

not 100 of millions of customers and not thousands of customers, so millions of customers, but they're pulling together

thousands, tens of thousands, hundreds of thousands of columns about those customers,

and and they're having to have effective and efficient ways to process that as part of a transaction that might be sub second, not sub 10 milliseconds.

These are

big, chunky, important decisions. They're background screening a person. They're

diligencing a business. They're, you know, running a mortgage application. And generally speaking, these things, if it takes, you know, you know, 1 second, 5 seconds, 30 seconds, that's

very high performance in the eyes of the marketplace.

As far as the access patterns, it sounds like it's largely

query based where the end user is requesting a given record or a set of records, and they're not necessarily looking for a push based API of

inform me whenever there's an update to this particular record or this particular user's information.

There are also trigger based workflows like that where they're saying, inform me when something changes.

And there's an access pattern as well where the data itself is

egressed

into clients' warehouses, and they do their own matching and their own trigger based monitoring. And

the desire is for us to be responsible for pushing in

changes to datasets

and full file access to datasets as they come into our infrastructure.

But the pool based pattern tends to be more preferred because it meets enterprises'

requirements and allows them to get the freshest, most compliant

element of data

at the time of interaction with the customer.

And as far as the work that you've been doing to build out Demist and get it to the point that you're at now, I'm wondering what have been some of the most complex or difficult engineering challenges that you've dealt with.

The most complex challenges have been

around

the data itself. The ecosystem is not only diverse and fragmented and exciting and messy. If it wasn't, we wouldn't be in business. But the actual data within each data partner is messy. There are modeled and actual attributes. There are data elements that have

lots of different unique definitions. It's not always obvious what the levels of the data represents,

and it's not obvious how to entity resolution across vendors and within vendors.

So getting from

a very rich but messy world of fragments and pockets of information into a single customer view that actually makes sense, you know, is, and in my view, will be for a long time, 1 of the biggest and most interesting engineering challenges. People have been talking about that for as long as I've been working, and and we'll talk about that. So there's a lot of engineering challenges around

data cleansing and data linking and data resolution.

Some other interesting engineering challenges have been all around, I mentioned it already in the infrastructure side, but configuration management and orchestration of the different interconnected systems that have to work

in order to provide

what is an enterprise grade system and set of promises to a bank, to an insurer, to another large enterprise

on the back of these upstream data vendors that sometimes just don't operate in that way and aren't necessarily intended to operate in that way originally when they built their own infrastructure. So for example, upstream sources in emerging markets that still don't have APIs, or if they do have APIs, they have significant downtime,

or they provide batch

Databricks, but they don't

have matching technology to go with it.

So

the provision of a set of enterprise

capabilities on top of that fragmented ecosystem provides a significant engineering challenge.

On the topic of engineering challenges, 1 of the

perennial problems of dealing with end users and being able to manage data access is the ways in which they want to access it. So, you know, everybody has dealt with the issue of having to, you know, wonder whether or not the FTP file has been uploaded to the right location so that I can run my downstream job.

And, you know, I noticed when I was looking through your documentation on your site that you provide a number of different APIs and file transfer methods, and I'm wondering if you can talk to

some of the strategies that you've built to be able to manage these different interfaces

for customers to be able to access the data in the way that is most convenient for them while still being able to maintain your own sanity of providing access to all to these different datasets

in various ways while still keeping the underlying data

sort of well managed and not having to deal with a lot of duplication?

Yeah. There's still a lot of that sort of stuff that has to happen in the marketplace, and certainly Demist is no different. You've gotta run check sums and various other things to make sure that you know you downloaded the right number of records versus what was expected, and what happens if there's a miss drop, and what are the downstream workflows that are affected? As I mentioned, we we integrate with a lot of downstream workflows at enterprises.

Maintaining our own sanity is is important, but it's worth noting that

enterprises have to manage this stuff themselves anyway. They're already accessing a lot of these datasets directly from source. It's just they have

duplication of their own systems across

each enterprise as they integrate with their systems. And, you know, we do that too, but we do it once and we share the benefit of that across our customer base. So we have teams of people that get notified when things don't land from our sources, and

we pick up the phone, and we call the vendor, and we talk to them, and there's not necessarily a magical technology solution to that. Sometimes something went wrong and

we need to manually rewire things or we need to rerun the transaction,

and that's okay because we only have

our team doing that once as opposed to every customer doing that all the way across the ecosystem.

We have

alerting and notifications

and workflows

notify our teams and our client teams when

things are not within expectations. I already mentioned that in the context of our metadata monitoring

around statistical metadata as well, whether things

are not just working or not working, but is the data very different to what was expected?

Because people might have to train their downstream systems

retrain their downstream systems if if things have changed materially.

These systems and processes have to exist to make sure the data is flowing the way it should flow.

They just shouldn't exist redundantly in every single enterprise and every single data provider. There's more and more platforms out there, not just Demist, but, you know, Snowflake has a data exchange, and Amazon has 1, and there's lots of different great capabilities out there where people are centralizing how this stuff is managed that is

separate from the content itself.

Struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end fully automated data observability

platform.

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence,

reducing time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo

to learn more.

The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box.

As you mentioned, there are a number of different data exchanges that have been growing up, and I know that you've been at this for a while. And I'm curious

what your thoughts are on the sort of vendor specific solutions such as Snowflake for being able to

provide data as a shared asset or what BigQuery has been doing versus what you're building into some of the different

priorities and use cases that would lead somebody to use something like Demist, which is a very full featured platform versus

just using a dataset that somebody provides through the Snowflake data sharing capabilities?

Yeah. And we partner with Snowflake and Amazon Data Exchange and others.

Those are some great emerging capabilities that we use for our own data ingestion as well in some situations.

And they solve a piece of the technology problem. They don't solve the compliance and licensing problem. You still have to enter a contract with the the third party data source. And the data may or may not be there and where it is, it's a

simple mechanism to get data. Our clients work with those platforms too if what they want is just a well defined,

curated,

available

bulk brick of data.

So it's a CSV and an s 3 bucket, or it's a Snowflake table,

or through Google, you can do dataset shares

without redundant copy and paste. In Snowflake, you can do the same thing, Redshift now as well. If it's a curated dataset that is just moving from point a to point b, that's where you don't necessarily need a platform like DNIST. If you already know what Databrick you want, you don't need the help in discovery. You don't need the help in deployment into real time

systems. The more full featured service providers like us, people would use us where they really need that last mile delivery and some of that recipe capability.

You know, we pride ourselves on going end to end with the customer and actually getting things out of the lab and into production, into workflows

with the right combination of data, typically under a single contract

and with ongoing monitoring of that real time deployed

use case

versus having to build all do all of that discovery, integration, and licensing

directly on top of those

exchanges that are out there.

For

working with these various external data sources, I'm wondering if you can just talk through the overall workflow of identifying

what a new useful source might be and then going through onboarding it, integrating it into your systems,

setting it up in the billing for customers to be able to find it and pay for it and just the end to end workflow from, you know, you deciding

this is a useful data source or somebody requesting a particular data source to it being available in your marketplace?

So data sources approach us, and

also our clients have great ideas, great road maps. They engage data vendors all the time.

They don't yet know whether there's a there there with a data vendor, you know, whether it's really predictive and useful and valuable or whether it's not. But 1 of the great conundrums in the marketplace is in order to know whether there's a there there,

they have to bear all of the cost and pain of the data engineering and the contracting and the diligence.

And often, it will take upwards of 6 months to go through that pain,

and that's before they know if they really wanna buy the dataset product. So what happens is

people just tend to stay with the large incumbents that they already work with. It's too painful and

marketplace. But in our situation, the clients will refer the vendor to us and say, oh, we work with Demist. You know, why don't you go and get onboarded into their platform? And then we'll test it, and we'll try it. And if we like it, we'll use it, and the vendor's happy and the customer's happy. And and we bear that cost and pain. Once we get past the discovery phase, we know that the

questionnaires that we built into our own certification

process.

We then diligence

the company,

and we review what they share with us,

And then we run data tests to make sure the data is accurate against these truth files, and then we integrate the connectors, then we make it available in our catalogs.

Sometimes we make it available publicly. Sometimes it's private. Sometimes a client might have a proprietary relationship with 1 of their partners, which involves data sharing, first party data sharing. And so we'll go through the same sort of processes with the supplier with the data source, but we'll publish it only to that client's own organizational

configuration.

As you have been working on Demist and dealing with all these different data sources and all of the, you know, data cleanliness issues that I'm sure crop up and being able to manipulate them and store them in a relatively uniform fashion and provide all of the supporting infrastructure to

make those datasets usable. I'm wondering what are some of the most sort of interesting or

far reaching lessons that you've learned about sort of data engineering as a discipline and the state of the industry as a whole?

I wouldn't claim to be as much of an expert in data engineering as you, and and I'm sure many of your listeners. It really is the basics,

haven't been applied yet that people have

known about for a long time but haven't been applied in the world of external data. I can't tell you how many times that there's been value created by something simple like just talking to the vendor and reading their documentation and actually implementing their documentation the way it was intended, like putting a plus in front of the phone number for an international phone number

or, you know, putting the right country code in as opposed to forgetting the country code. I mean, these aren't necessarily data engineering

challenges or best practices. They're just being careful and rigorous and integrating things in the right way. Because what will often happen is, you know, a bank will work with a vendor. They'll throw the file over the fence,

the vendor will throw it into their system, they'll throw the file back, but often people won't even actually look at the data, and then it lands back at the bank. They're like, the the data's crap. No. The data's not crap. It's just nobody had the bandwidth or focus to optimize

the matching and the logic and do the basic stuff right.

Other things like you get misspelled

fat fingered, like, business names, for example. Like, you know, it might be instead of demist data

limited, it will be, you know, Demist space data with a dot in it or something like that. We'll do con canonicalization

of the input data,

sometimes using third party

data providers, sometimes using our own standardization

engines that we've built into our platform. But getting basics into having the right clean input data means that the match rate is is high quality.

It's not magical data engineering. It's more just getting the inputs right and then

conducting

tests across the outputs

and standardizing

how that's done. And I think the business problems, it's been surprising the business problems that we've solved over the years as well.

While our enterprises always wanna talk to us about the bright, shiny, innovative things they're trying to do with

aerial imagery or footfall traffic or sentiment. Like, there's lots of great things out there, and they they help. But the things that really move the needle for a lot of enterprises is, again, the basics. It's how do I know this is a real person? How do I know they have a job? How do I know this company exists?

How do I do it at scale in a predictable way

across lots of sources and lots of countries and lots of products? And that's what's most interesting for me is the the boring stuff that keeps time and time again coming up with customers that isn't new in the zoo in terms of business problems, but the maturation of the ecosystem,

technology means that they can now solve it in a much, much bigger and better way for everyone's benefit.

As far as the

ways that your customers are applying the data that you're providing as a service, what are some of the most interesting or innovative or unexpected ways that you've seen it applied or some of the most interesting insights that you've seen people gather from the datasets that you provide?

So some of that we can't share, of course, because it's

proprietary to the customer and

confidential.

The insights that people come up with people are usually looking to solve something that they already know and that they're already doing in some way when they get started. A banker is already collecting data

in a printed financial statements from a business, or

a lender is already getting,

you know, photocopies of passports and driver's licenses, or

an insurer is already trying to find out what vehicle, make, model, year you have and when your house was

out of the fire station. These things are all now accessible and pre syllable from the data ecosystem without friction from the customer.

It's, you know, very not insightful to bias. People already know if you're close to a fire

station or a fire hydrant and you own a house, that that house is therefore less likely to burn down. It's not a new insight that that is the case. It's just a new method of getting the information.

People already know that

if a business has strong cash flow, it's gonna be a better credit risk. It's just that

instead of getting it from audited

p and l's

through PDFs

and faxes and, you know, people talking to bankers, that you can now get it through a single click and access to accounting systems.

It's not a new insight. It's just a better way to get to the insight. I mean, our clients do do various really interesting, clever things. When they step back and they scan the ecosystem, they find correlations between

different patterns, different spending patterns, or different weather patterns, and things that they haven't thought of before,

whether it's a an auto insurer figuring out that if you drive

east and west

to commute versus north and south, you're more likely to crash because you're driving into the sun,

sunset and sunrise, which, you know, affects your

insurance propensity

or on the fraud side, interesting insights that come out from

how

frequently data has changed and the comparison across datasets which identify

people that are creating fake profiles. There's some insights there. I mean, there's always those insights in credit and lending where they find

the use of unstructured data such

as text on application fields and look at the grammar and the way in which people write things that in some countries, people can use as part of an underwriting process. So lots of interesting things that are happening, but I get, as you can tell, more interested by the boring stuff. In your own experience

of building and growing the Demist company and the platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It's client centricity and tenacity and not just

buzzwords you put on a poster in in the office that

building any kind of enterprise technology business is hard

and has lots of twists and turns, and

building a team around that is hard. And at the end of the day, the true north around that is

listen to customers and

be open with them

and work hard, things will go up, things will go down. But if you put a technical team

that doesn't have an ego

with

a customer that actually has a real problem, and then you get out of the way, then lots of great things can happen and smart people will build smart capabilities around delivering value.

I've also learned over the years that it's really hard to stay focused, especially, you know, when you haven't raised the sorts of

you know, galactically enormous amounts of money that you read about in TechCrunch,

where you have customers solving real problems, but they're not always

directly down the line of what you thought your product strategy was going to be. But it they're close enough that you build related technology and you end up proliferating into a lot of different use cases. And, you know, every 6, 12, 18 months, you've got to pop your head above the

cases. And, you know, every 6, 12, 18 months, you've gotta pop your head above the parapet and refocus and refocus and refocus. And I've learned that that's a hard thing to do. And that again, you know, if you

have great people and good customers and they're solving real problems and you put them all in the room together and talk about those issues openly with them,

that it trumps all of the preplanning that you thought you knew in advance of building things.

You know, better ideas come from the marketplace than come from a whiteboard.

That's the main thing that I've learned over the years through engaging with the marketplace. And we're pretty excited about where we are and what's in store for the future. The market's definitely matured

and

enterprises

are far more effective at capturing value from data.

They,

solving some of the biggest problems they've ever solved, especially in a post COVID world. Everything's digital,

the ability to customize journeys with customers is much, much better than it used to be, the enterprise tech stack's more mature,

and

on the ground floor of people harnessing

value from data, whether it's internal or external.

The future is bright

around solving these problems across the entire stack.

And for people who are interested in being able to incorporate external data sources into the analyses and machine learning workflows that they're building? What are some of the cases where a Demist is the wrong choice and they might be better served by either using something like the Snowflake data sharing or building out their own internal capacity for being able to collect and clean up these datasets?

As I mentioned earlier, if it's a really plain vanilla brick of data that

is well curated and well understood,

then

sometimes it's easier to just go directly to sources like that, especially if it's a very large dataset

where

enterprises don't wanna

deal with the storage cost associated with that. The the mechanisms offered by

those cloud platforms can be a pretty efficient way to do things. There's a very important build versus buy question around data management. Some enterprises do see the way in which they work with external data as

a strategic advantage. There might be a solution provider that's building

fraud scores or there might be a bank that is accessing pockets of data that are so

sensitive

that they have a unique advantage.

In those situations,

they might just I mean, it might still be the right choice to work with us, but people just don't want to because it's

not something that they wanna in any way outsource in any way, shape, or form. They're the 2 areas.

Oh, sorry. There's a 3rd area as well. Sometimes it genuinely is just a single source problem.

People know what the source is. There is only 1 source.

Integrating with it is not that painful.

We're a retailer. You know, we think of ourselves sometimes as a supermarket. You come to the supermarket because you wanna buy lots of different fruits and vegetables, and you wanna make sure their diligence back to source and they're safe to eat, and you might wanna change what you buy every week. But if all you bought every week was bananas,

and you bought a lot of them, then go to the farm. Don't go to the supermarket. It's cheaper. That's when it's better not to use us. It's a good metaphor. And I think 1 of the interesting challenges there too is that it might start with 1 data source, and you say, okay. I'm just gonna build it myself.

And then down the road, you add a second data source. At

At some stage, you get to a tipping point where it makes sense where you actually are better served with buying something from something like Demist, but you've already put in so much effort that you then suffer from the sunk cost fallacy of, well, I've already built out all these other systems and, you know, they're becoming a little unmanageable, but what's 1 more data source? And so there there's always that challenge of, you know, at what point do you decide that you've gone too far and you need to just throw it all away and go with the prepackaged option? There's an analogy to product development and engineering, which is it's

always hard to get people to focus on refactoring anything.

It's always the thing that you do later, but then when you do do it, you always breathe a sigh of relief. And

it always ends up taking a lot less time than people thought it would, and creates a lot more benefit than people thought it would, but it's always hard to prioritize

doing that housecleaning.

So, yeah, that does happen. And you've pretty accurately summarized what some of our sales cycles look like. People

take 6, 12, 18, 24 months sometimes

of doing things 1 at a time directly with the marketplace.

And then, you know, some of them eventually save a lot, and some of them just keep chipping away and incrementally

adding 1 more if then statement, 1 more piece of gaffer tape, 1 more thing into their system.

Now regulation and compliance is a really interesting

question that comes up. And there really is this, you know, if you add in just 1 more thing and 1 more

element and 1 more stream and 1 more contract and 1 more systems integration directly to source, and don't centralize it. At some point, the chief data officer comes along,

or even the regulator knocks on the door and talks to the chief data officer and says, where did you get all of this data? You know, Facebook's telling their customers where they got the data and giving them the right to manage it. So is Google. So is everyone else. You're a bank. Can you tell us where you got all of this third party data and how you manage it in 1 place

with 1 set of reports?

And that will create the impetus to manage this stuff centrally.

As you continue to build out the Demist platform and work with your customers, what are some of the things that you have planned for the near to medium term, either in terms of new datasets or new interfaces or new platform capabilities or general projects that you're excited to get started with? We're very excited about some of the new technologies around

consent management and privacy enabling technologies

that double down on what we already do and

what that means further unlocking data that's in the ecosystem.

We're very excited about

the

last mile and self serve capabilities.

We usually are a fairly white glove partner of customers for the first 1 or 2 use cases, but then as they start to get good at working with data sources, you know, customers do amazing things, and they build

great recipes that we'd never thought of before, and they deploy them into systems that we hadn't thought of before.

We've recently been launching a lot of capabilities around that.

On the last mile stuff, we're building a more verticalization

in our solutions so that we can

focus on key business problems that we know work. You know, welcome to my supermarket,

here's a cereal

aisle,

but we know these 3 boxes work in this particular problem domain. We can save you time. If you want different cereal, no problem. Put it in the basket. We'll help you out. But we know these 3 work relative to each other. We know they work in the context of this business problem.

So we're excited about that, and we are excited about the emergence of

enabling

cloud capabilities like Databricks type delta sharing and what that means for data ingestion,

table sharing through platforms like Snowflake and Redshift and Google, and what that means for the ability to do 2 way and 1 way data enrichment and sharing,

build workflows around that. So we're excited about

a lot of

the development of the technology ecosystem and ultimately with the broad vision of our view at Demist, which is the ecosystem of external data is this

vast, untapped resource. For all of the data that's out there, people use relatively little of it in production.

And it's not because the data doesn't exist, it's because there's frictions to deploy it and there's frictions to get it. Our average customer uses

typically more than 10 times our average known customer. Even if we don't do anything that clever with the data, that speaks to how

the market isn't as efficient as it could be. And when you take friction out of accessing something and doing something, whether it's data or AI or BI or anything else, people don't

just do it more cheaply. They do it more often. So we're excited about

unlocking

access so that,

people can gladly use more data and capture more value.

Are there any other aspects of the work that you're doing at Demist in the overall space of collecting and managing and incorporating these external datasets for analytical purposes that we didn't discuss yet that you'd like to cover before we close out the show? I just encourage folks if they

are in the ecosystem, if they're

managing a direct integration with a single data provider

and they're looking to consider different alternatives, take a look at demist.com,

sign up, play around,

provide some feedback.

We have a consumption based pricing model so people can sign in and try it and pay as they go if they want to.

And

we we see it as a very, very big ecosystem in a very big marketplace

that is unlocking a lot of value. So I just encourage folks to reach out and engage with us, whether they're a supplier, a consumer,

an individual who's just interested in in the space and and talking about some of the things we talked about on today's show.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in tooling or technology that's available for data management today. There's so many great solutions out there

for different pieces of the data engineering stack, and we get pigeonholed into our little silo of external data, but I know the question you're asking is much more broad than that. I think the picks and shovels underneath

AI and BI and decision systems

for getting clean data

is still

nowhere near as simple as it could be, and I believe that

data prep

and

cleansing is,

I believe in my bones, at some point, someone somewhere is going to come up with this magical system that just

you put messy data in and clean data comes out.

I haven't seen that yet. I haven't seen

the system that does

the basics that we all learned about in either, you know, computer science 101 or predict or Statistics

101, like, let's remove the outliers,

let's take this numerical value and quantize it, let's

apply a log or an exponent to this column. Let's

change m and f to male and female. Let's

get rid of the underscores and make it capitalized.

Let's, you know, delete this erroneous column that's always filled with a null and is totally a waste of time. Like, all of those things that

every data engineer on the planet still spends,

you know, in my view, you know, a lot, if not the majority of their time on. Systems should just do that for you, and

we all should be spending our time thinking about the more interesting problems, like how do you linking and matching an entity resolution and deployment to maintenance. But that to me is a big gap. And look, if anybody on the show

knows of that tool and wants to point me to it, I'd be a very happy customer. But I'm still waiting for the day when data prep has an easy button. Yeah. There are definitely plenty of vendors that would like to tell you that it exists, but eventually, you get to the point where the easy button stops working and you still have to do all those same things.

Yeah. And then you throw out the tool, and you end up just jumping back into Python and and writing it yourself. And maybe the answer isn't a vendor solution. Maybe it's an open source solution, and people are developing some great libraries in Python to do various subsets of this. But the world is still

quite

early in its development of these capabilities, and I'm sure lots of great things will come out that solve that problem over time. Well, thank you very much for taking the time today to join me and share the work that you're doing at Demist Data. It's definitely a very interesting problem domain and an interesting set of capabilities that you've built out. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It's great to connect.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links