Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dan Delore about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and an early engineer on the Dremel project. So, Dan, can you start by introducing yourself?

Yeah. Hi, Tobias.

I am the vice president of data at SoFi.

Currently, I've been here a little over a year working on building the data platform

for SoFi.

Prior to that, my all my

real professional experience was at Google. I worked at Google for 14 years.

Most of the time on Dremel and then later BigQuery

when the teams were combined.

So your audience is probably familiar with the Dremel paper.

BigQuery is the externalization

of Dremel into Google's cloud.

Prior to that, I was doing my graduate work at Brigham Young University.

And do you remember how you first got started working in the area of data? It was during my PhD program. Of course, back then in the early 2000, we called it data mining.

And so I was studying empirical software engineering,

my emphasis,

and was in the area of

mining software repositories.

So what we were doing was scraping data from all of the artifacts of open source software that we could find,

indexing it, and then trying to do interest

analysis of that.

When I went to Google, I worked on ads optimization,

which was, again, sort of a use of big data problem. We were trying to build

keywords suggestion

so that advertisers could get help in building out their advertising campaigns.

After 2 years on that, I had the opportunity to join the Dremel team as an engineer.

From there, it was all big data for me. Can you start by sharing a bit about what your current relationship is to the overall data ecosystem and maybe the CliffsNotes version of how you ended up there, which you gave us a little bit, but maybe a little bit more sort of detail about the different juncture points along the way. Chris Krasner (3six 0 six): Yeah. At this point, I see myself as a consumer of the tools. That's my relationship to the ecosystem. I love all of the

explosion of tools in the modern data stack.

I think that we've come a long way in just the last decade.

And so it's really exciting for me to be out here getting to use the tools and see how these things fit together. I think during my time on BigQuery,

I had a very deep but

somewhat narrow view of the world

in terms

of seeing query execution

and data storage

as the primary drivers of the problem.

And really, now that I'm on the other side, I see that

it's not really my primary concern day to day, the

exact

optimal performance

of a given query.

The bigger problems are much more finding data, making sure data is reliable,

monitoring SLAs about data delivery,

answering business user questions.

And so I've seen some other pieces of the data stack that I think there's some cool innovation happening, but we still have a lot of work to do.

And now jumping back to sort of earlier in your career, as you said, you were 1 of the early engineers on the Dremel project, and that was,

to my understanding,

Google's next generation

of data processing

after their work on MapReduce and, you know, as Hadoop was starting to become

widespread and mainstream in the ecosystem outside of Google, Google had already moved on from that paradigm and started working on Dremel, which has come to be more of the sort of agreed upon better approach. And I'm wondering if you can just share some of your

perspective and context on that stage of the data ecosystem and maybe some of your experience of working on Dremel and the concepts that it was

encapsulating as compared to the MapReduce paradigm that was starting to grow and be fostered outside of Google.

Yeah. Absolutely. And, yes, I think you've framed that exactly right.

The original engineer, Andre Guverev, who started the Dremel project inside Google,

He was trying to solve the problem of

boiler plate

code, long startup times,

difficulty in chaining steps together,

difficulty in writing the programs, all the things that we know about MapReduce,

the initial ecosystem.

At Google, there was a language called sawsall, which some folks may have heard of. You could think of it as like parallelized Python that could compile down to map reduces.

And so he was trying to solve the problem of how could I just write a SQL query and how could I have it run really fast.

It was fairly rewarding. In 2020, we got to write the VLDB test of time award paper

for Dremel. The original Dremel paper was published in 2010. And so in 2020,

it it won the test of time award.

And when we went back and looked at that and tried to break it down, broke the contributions

of that Dremel made to the industry down into sort of 5 categories.

1st was bringing sequel back. People who are operating today or or have started since the kinda Dremel brought it back may not realize that in the

late 2000, early 2010s,

it was believed that SQL was not a proper language for big data and it wasn't gonna work. And I think now we can see with Snowflake, with Redshift, with Athena, with BigQuery,

that everybody's all in on SQL as the right language even for very large datasets.

The second thing for Dremel was disaggregation of compute and storage.

The idea that I can scale 1 without scaling the other. And I would put in there along with that the horizontal scaling. The idea that I don't need ever larger computers

to be able to handle more data, but I can have just a fleet

of computers. A lot of that was following for the MapReduce work, obviously.

The idea of in place data analysis that I would query data wherever it happened to reside, for us, that primarily meant in Google's distributed file system, initially GFS

and then later Colossus.

And the idea of serverless that the user running the query shouldn't have to worry

about whether the servers were running or starting them up. They should just be there and be able to run the query. And then of course, columnar storage.

That was 1 of the big innovations of the Dremel project. The idea that you could do columnar storage but still allow nested repeated data.

For us, inside Google, that meant protocol buffers for much of the world. Now that means JSON,

but it's the same idea. The idea that

I could still give you an efficient query over nest bit repeated data without you having to

normalize everything.

The other key aspect of Dremel was that it was still a divergence from the sort of then traditional data warehouse paradigm of everything being highly structured, very rigid, but, you know, easy to be able to answer a fixed set of questions and instead being able to support unstructured datasets, querying data where it lives,

to some degree, breaking down the concept of data gravity as a blocking factor in what you can do with the information.

And I'm wondering if you can just talk to

what you see as being the broader impact on the ecosystem of allowing for that sort of data federation

working with semi structured and somewhat unstructured datasets

and being able to then, you know, allow for this new paradigm of extract load transform that was

prior to that point largely

intractable.

Yes. I think that's true. I think there were a couple of things there. 1 thing that we always focused on

with Dremel was making the queries fast.

It was supposed to be an interactive system.

The majority of queries we ran,

ran under a minute, many under 10 seconds.

And it really changed the paradigm from a MapReduce where I would code up my couple classes, my mapper and reducer, and then I would

kick off inside Google board job to start up however many instances of my objects I needed. And then I'd wait for my analysis to run and it might take a couple hours. And so I would go play a game of pool or something and come back later and see the answer. Once we got to the Dremel service, I should say, the initial versions of Dremel, the first couple years of Dremel,

you did have to prepare your data ahead of time and you had to load it on the local disk on the machines and you had to start up the servers. And so that, it was not great for adoption at that point. But at the point we moved to the Dremel service where we had a standing tree of servers tree because we did aggregation trees at that time. You'll see in more recent papers that we've changed the architecture there. But the idea that you didn't have to do anything, and as long as your data was sitting in the remote file system,

you could issue a query and that query would execute immediately

and get you the results back so quickly that you didn't even have time to write the next query before the answer came back.

It became really powerful. What I found is

as people get access to tools and more data,

it really only ever leads to more questions that they want to ask.

And so there's always this inflection point. We saw it all the time when we were selling BigQuery into enterprises

that they're just this inflection point in data consumption in these large organizations

when the users get the ability to ask questions and get answers faster.

So I think that's 1 of the big ways that it changed the dynamics.

I'm not as up to speed with the architecture of Dremel as I am with systems such as Presto, which are, I guess, the spiritual successors at least of what you were building there. But I know that in order to be able to query across these different datasets, you know, whether they're living in s 3 or Hadoop file system or the Google file store. You need to have the metastore to understand, you know, what are the files on disk, what is the schema of those files so that I can be able to

structure the queries and be able to, you know, build the query plan to understand, okay. These are the files that I actually need to fetch. These are the operations I need to perform in them, etcetera. And I'm curious if Dremel has that same architectural component of needing to be able to

schematize

and sort of maintain that metadata ahead of time before you're actually able to execute those queries or if you have a different system for being able to propagate that information to the query planner to be able to understand

the actual sequence of operations that are necessary to be able to satisfy a given SQL structure?

So today, the answer is yes. In what Dremel has become, BigQuery. BigQuery does have a metadata store. It does have managed storage,

and it does use some of that information in query planning

less than what you would probably expect relative to other systems.

But that's because

adrenaline, we didn't have anything like that. We never built a metadata store. The files were self describing.

And we had this interesting dynamic for a lot of years on the internal system,

where much of the data we were querying,

we queried exactly 1 time.

Meaning someone issued a query, we read that file for the very first time and the very last time we were ever gonna read that file. And that sets up a really interesting dynamic relative to the creation of this metadata and the computation of statistics

that often get used in cost based optimization.

If you are going to double the expense,

right, from I'm only ever gonna read this file once to now I'm gonna read this file twice,

you would have

to bring some benefit

that essentially completely

removed the cost of of reading the file at all the second time.

Because the first time when you're computing statistics, you're always gonna have to do a full table scan, Right? And so I think that was 1 of Andre's really clever insights and innovations when he started Dremel was just to say we're gonna make this thing performant

without ever having to precompute anything.

And that led to a lot of years of

really simple usage for the users when you could just point the tool at your dataset, not have to wait for any pre computation or load or anything

and have the query still run at interactive speeds. Now, the way we did that, of course, we didn't have magic. The way we did that was we threw tons of resources at the problem.

Right? So you just get a lot more CPUs, you get a lot more network. Really, I think the network was the key to why Dremel succeeded.

I know some people have asked why was Dremel so successful inside of Google, but projects like, for example, drill

were not equally successful outside of the market.

I think 1 of the main reasons was the innovation on the networking inside Google.

We did everything we could to saturate the network inside the data centers we ran in. 1 of the things that people outside the Dremel team didn't know was that we were never very good citizens in terms of resource usage. But it was okay because everybody was using Dremel. So they were all benefiting from us abusing

every loophole we could find.

The other

aspect that Dremel has led to is this idea of data federation where you have this query engine that, as you said, is decoupled from the storage layer, but that also gives you the opportunity to make

the query execution pluggable so you're not necessarily just working with files on disk. You might also be, you know, translating your SQL statement into a different statement for a different relational database or a non relational database and then being able to aggregate data across multiple different systems to be able to build analyses across them so that you don't necessarily have to do this extract and load process to be able to actually answer questions. You can just say, I'm gonna query the data where it lives, and I'm wondering what you see as the

sort of broad impact that that has had on

analytical capabilities,

both in terms of what tools like drill and Dremel and Presto are able to do, but also in terms of the ways that we think about

how to build data systems and how to think about the contracts

between data producers and owners and the downstream consumers of that data?

Great question. Lots of stuff to dig into there.

So 1 thing I would say is in addition to just being able to query the federated data,

I think

the revolution that came about

because of

us having done that in Dremel, at least with Inside Google, the revolution that came about was that

suddenly people expected

to be able to join datasets

even if the data didn't reside in the same system.

So prior to that, things like MapReduce the initial MapReduce, of course, didn't join anything. You had 1 dataset and you ran it. You you did a map and a reduce and you produced an output and unioning things even was hard. But joining was certainly hard.

So Dremel, because we launched join and in order to be able to join, we had to be able to do it on the fly, We called it shuffle. That's a repartitioning operation so that you can repartition the data to get the right join keys. And,

you know, our primary join strategy was always partition hash join. We tried a couple other. We tried lookup join. Certainly, we do have

functionality for broadcast join

if 1 of the datasets is small, and we did a lot of work to push small up to, like, the size of a gigabyte.

But there were always limitations. So our primary

mode was the shuffle hash join.

And

once you were able to read federated data

and you could still do a join

with an on the fly partition in memory and that was your core strategy,

it opened up the ability to join data from wherever it happened to be. So for us, we use Bigtable, obviously, the the internal version of

of HBase. But when you had a Bigtable

that you could join to a file in Colossus, that you could join to a query result from a MySQL database or an F1 query or something like that,

it really changed people's

expectation and I think getting back to the idea that data just engenders more questions, it really

opened up people's eyes to the possibilities

of look at all these interesting analyses I could do

that now require no work for me to pre process the data or do a multi phase

orchestration

with all sorts of different transformations. I can just express it

as a common SQL statement using the join operator I'm used to. And under the hood, Dremel takes care of scheduling all of that, building the query plan.

So I think that was 1 way in which it changed the paradigm where people were able to expect to join everything.

And then it went to the next level

when

we rolled out BigQuery and now you see the same thing evolving with Snowflake's data marketplace.

But the idea that the entire world's

data analysis system could be 1 global system

where all the data was joinable

from its original place at rest. So there was no need to copy

or get stale redundant

versions of your data anywhere.

I could just share my table with you

and you would be able to query it.

The second part though of your question, because you asked the question then what did that do to the relationship between data producers and consumers?

I think it actually sort of broke the relationship and it's 1 of the things we're struggling with now at SoFi, figuring out how to rein that back in.

And I didn't always see it when I was inside Google. I think in the early days,

you could accurately describe what we built with Dremel

as a data lake or at least an early version of a data lake. And then when we started to roll out BigQuery, we started by trying to keep that same data like paradigm.

And we discovered after a few years that it just it wasn't working for organizations,

and we had to become much more like a data warehouse

in BigQuery

moving away from the data lake paradigm.

And I'll explain what I mean by that. In a data lake, I think the governance

and

the agreements between data producers and data consumers are hard to maintain.

It lowers the barrier for entry for me as a consumer

to just be able to query any data at rest as long as I can get access to that data.

But it also leads to me potentially

accidentally

taking dependencies

on things that the data producer has no interest in guaranteeing.

And so schemas can change out from under me, data can disappear,

it can be low quality or unreliable data, and

I have no way of knowing that. And so I think there is

now back toward the data warehouse paradigm with things like Snowflake and BigQuery,

trying to keep all the advantages of federated data and all of that,

but making it clear that the data that I actually

need to be reliable because it's going into my regulatory

filings or it's being shown back to my end users or something like that. I do

think we're going to see a big move back to

I want that data to be controlled and monitored and reliable.

All of the things that I used to get from

the really strict ETL pipelines

owned by the central data team with guarantees about

reliability and freshness and all of that.

With that shift from the sort of data lake paradigm to where we are with data warehouses and now the up and coming term being the data lake house where it's sort of this hybrid of, I can dump all the data that I want, but it has to be at least semi structured and cleaned before I'm gonna bother querying it and exposing it to other people. I'm

wondering what your view is as somebody who has helped to build a lot of the foundational tools and concepts that led to these capabilities.

Now being a consumer of them, like, how that

impacts your

sort of design sense and how you think about the appropriate way to build data platforms

at scale in a way that is sort of flexible and agile, but structured enough to be able to actually have reliable outputs and high quality data?

That's exactly

the right place to go with this. And I would throw 1 more thing into the mix. I know it's a bit of a fraught topic. So I'll just say the data mesh idea,

regardless of whether you take the canonical definition or the more loose definition as I prefer, but this idea of distributed data ownership,

that's really what we're pursuing at SoFi, and I think a lot of my peers and other organizations are pursuing as well because

they're recognizing

that trying to staff up

a

central data engineering team that owns

all of the phases of data from production all the way to consumption isn't going to work.

We

are, I think, a little bit like most organizations,

kind of fumbling in the dark. We're trying to figure out what works for us and how we can guarantee

some of the things we need. We have being in financial technology

and also having just gotten a bank charter. So we're not just

like playing in Fintech, but we're an actual bank regulated by the OCC

and other organizations.

We have some heavier audit requirements

and regulatory

requirements

than I think we were used to in the past.

And so that's leading us to try and find ways to still be able to distribute,

but still provide guarantees for certain datasets.

The organizational

structure that we've landed on is to have our engineering teams

responsible for the production

and ingestion of their data

all the way into what we call the cleansed area in our data warehouse.

We conceptualize

our data warehouse in 4 zones.

The first is raw.

That's where the ingested data lands

and there are really no guarantees. You can think of this like the data lake component.

It's private by default.

The data may be schema less, meaning they're just dumping JSON blobs into variant types in Snowflake and then cleaning it up as the first phase in their their ELT process.

But our goal is for all

analytic data or all potentially

interesting analytic data to land in the raw zone in Snowflake.

So we are putting all of the data inside Snowflake

in the raw zone.

This is

similar in principle to what other folks are doing by putting landing in s 3 first and then transforming from there using either

snowflakes ability to query directly from s 3 or using a query engine like Athena, something like that. For us, I wanna shut down the multiple tools, multiple sources of truth problem and so we are putting it all inside Snowflake,

inside private schemas so that people cannot accidentally be getting to each other's raw data.

The second zone we call cleansed. Cleansed is where you impose a schema.

You do whatever cleaning is necessary, deduplication,

standardization

of data types, data content,

probably some introduction of synthetic join keys

to support the next phase in the process.

And the cleansed layer is where we expect there to be a contract

between the data producers and the data consumers.

Meaning, if I put a field in there, I'm not gonna just pull that field away

without some automated testing being able to catch me.

For example, we don't give any group

direct DDL or DML access to their cleansed schema.

The only way you can update

the schemas or the data inside your cleansed schema is via some automated process,

which requires GitLab, it requires our CICD

pipelines to run.

And so your consumers always have the ability to detect if you're changing something.

The next zone for us on top of cleansed, we call modeled. Pretty self explanatory, it's where you build your data models. There will be some amount of join aggregation there. We'll have all sorts of flavors in there from

star or snowflake schemas,

fact and dimension tables, just flat broad tables.

And we're doing it still in a distributed way there where we do have a central core data warehouse which is the most important piece.

But then around it, we have what are called team data marts

where individual groups for the different vertical business units we have

can be building their own data models. And then the last zone we have on top of all of that we call summarized.

That's where you build your reporting layer. So those should be the tables that are built

for optimizing some report.

In general, the pipeline we think will flow where data science, they can prototype reports directly off the model schema.

But then if they find the SQL query for the report they want, they'll probably turn that into

the equivalent of a materialized

view

or base tables with aggregates or

something that allows them to do very simple reporting.

And we'd like to pull most of the business logic

out of our business intelligence tool, which is Tableau,

and keep it all in the data warehouse.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data.

For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's

data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar.

You'll also get a swag package when you continue

The model in question is something that has been interesting to me for a while now because of the fact that since we do have these very powerful and flexible engines that allow us to be able to query across structured and unstructured datasets, and we have the power to be able to just kind of throw everything into a giant table and be able to run aggregates across it. We don't necessarily have to be as

rigorous as we used to be when we were all relying on a, you know, a massive Oracle server and worrying about the licensing there or as, you know, Microsoft SQL Server and having to do these, you know, stars, snowflake schemas, or, you know, even data vaults. And I'm wondering what you have seen as the impact of these newer generations of technologies

on the ways that people

think about data modeling and the sort of rigor that they place in terms of how they're actually structuring their schemas versus just saying, okay. Well, this is all on a table. I can run a select statement that gets me what I want. Want. So, you know, good enough. I'm off to my next task.

I think it's been a really good thing in terms of

getting more analysis

inside the data warehouse. As you say, now that we're not constrained

by the resources

of running on a single Oracle or Postgres box or whatever, and having to lock people out during certain times so core processes could

run. That piece of it has been fantastic.

What I'm encountering since I came over to SoFi,

which actually never occurred to me when I was building the tools at Google, and it probably should've but it just didn't click until I was the 1 doing it.

There were other advantages

to the old model

and we looked just at the cost and we said well, it was about resource constraint and now I don't have the resource constraint. So now I don't have to do any of that other stuff anymore.

1 of the problems with that is

if everybody is so take our model, for example. If we were to tell everybody we only have 2 zones, we have raw and cleansed.

And then because we use Snowflake and Snowflake is so fast and we have so many resources,

everybody's just gonna query directly against the cleansed data, which is still largely in a normalized schema.

And then you're gonna produce your reports directly from there.

What you get then is a proliferation of business logic

into everybody's

report and dashboard.

And the bigger your organization gets and the more distributed,

the higher the likelihood that there are gonna be disagreements.

And we see this in spades at SoFi right now. Executives getting 3 different

reports that all claim to be talking about the same thing but showing different numbers.

Because either they've used different filters and so their date ranges are slightly different or their criteria for the cohorts that they're including are slightly different. Or they've joined

in a different order or they've used a left outer join instead of a right outer join or they did an inner join where the other guy did a a right outer

join. And so we've got

all these different layers at which

the reliability of the data can break down.

And so I think right now what we're seeing is actually a pushback in the other direction

where we went in 1 direction because we now had all this new flexibility because we had Snowflake and we had these tools that could do things really quickly. And now what we're seeing is we need to reign that in and we need to find a way to keep the goodness of being able to run

a ton of analysis and not having data engineering be the bottleneck

of the organization,

but also having some sort of gatekeeping and guardrails that prevent people from

inadvertently

making mistakes.

Well, clearly, the solution is that we just build a new category of tools and say that that's gonna solve all of our problems. Right? That's what the metric layer is for. That's right. Yes. The metric story is definitely what we're looking at and I would throw 1 other 1 in there, which is the data quality tools.

That's for us sort of the most recent piece we've brought in to our system.

And I think that as we get more rigorous about testing and monitoring data quality

so that our end consumers are not our canaries in the coal mine,

It's very much, I think, akin to the transformation we went through as an industry

over the last couple decades toward

expecting engineers to always write unit tests

on everything they were running and expecting those tests to be run.

I don't know other companies if they're quite the same at Google. I know mostly about Google but at Google, the system we have runs every affected test

at every single change list. And so you always know if you broke something and it does it before you submit. You're not even allowed to submit code that would break tests. I would love to see us at SoFi

get to a point with these data observability

tools

where people were not

unknowingly pushing schema changes that broke their downstream consumers or changing data values

that invalidated some report.

Jumping back a little bit to the question of data federation,

I'm also interested in your take on how you see that impacting

the sort of questions of data governance and

the sort of formulation and application

of ethical priorities on how that data is being used and consumed and and remixed and some of the ways that you're able to

guard against or account for bias in the ways that you're building your analyses because of the fact that more people have more access to more data?

I think there's a lot in there. So federated data, I think that has been 1 of the downsides of it, that you lose some ability to govern.

In some senses,

that's okay

because not all data has to be so tightly

governed and controlled. You want there to be flexibility

and experimentation.

But I think there are places where it runs into problems.

I think

it's particularly

hard to govern and control PII

in a data lake environment.

Particularly

if that PII is sometimes

stored

in,

schema lists or schema on read fields.

So I'm saying here, for example,

I have seen situations

where PII is being stored inside JSON blobs, inside files written to s 3.

That becomes a real danger for governance and for compliance

because

if it's stored at least in a structured environment,

what I can do is read the first few rows

of a dataset

and identify

it looks like the values in this field

contain a social security number or address or something.

But if I've got JSON blobs,

which do not

have strictly defined schemas,

I can't read just part of that data. I have to read all of that data and I have to parse every JSON block and look through every field to say, do any of these look like PII?

That becomes a cost prohibitive

expense, I think, for many organizations.

So I think 1 of the important way you asked about ethical considerations. 1 of the important things we have to think about is

as we take advantage of these new tools that have huge scale

and the ability

to give us greater flexibility,

we have to think about whether that flexibility

is being applied to data which is safe or data which could potentially leak.

I'll throw 1 more on that if I can. It's a little divergence from the question you asked but getting back to this idea of joins, I think that has been

the real innovation

in the last decade

since Dremel. The idea that I can join data

from all sorts of different datasets.

And I think

historically, when people think about data breaches

or misuses of their data,

they think primarily

about leaks, about exposing this is like the Equifax

thing. Right? Someone's gonna get a hold of my 140, 000, 000 records, and then they'll know all the data that I need.

But I think what we're seeing now is the actual bigger danger is not that they get the data that you had, but that they get your data and they join it to some other data set that you hadn't even considered.

We've seen leaks like this for example, right? The the idea that

American military personnel wearing fitness trackers that were accidentally

hooked up still to Strava

and continuing to do their exercise regimen even though they're deployed to secret bases

around the world,

suddenly draws a map

on a publicly accessible map that says, hey, there is something going on in this region of

pick your favorite foreign country that you don't want

adversaries knowing about military bases in. Or the New York taxi data that was released and then read it found,

hey I can join this data with paparazzi photos and I can find out where famous people are taking taxi rides to and from and where they tend to be at a given date and time.

So I think we need to get in the mindset not just of saying I secure my own data but

what is the absolute worst thing someone could do if they had my data and random dataset x that they could join it to? Yeah. And that's giving rise to another sort of subindustry

of this idea of privacy engineering where you have

before you actually expose any dataset, even whether internally or externally and share it, you actually go through and

obfuscate the data or, you know, mask it or, you know, add in some random noise

to make sure that you

can guard against with reasonable expectation against these re identification attacks or, you know, these, you know, data joining attacks where you say, you know, this 1 piece of information is innocuous in isolation,

but when I join that with, you know, US census data, actually, I can see exactly who this person is.

Yep. Yep. And there's another thing. So you mentioned the metrics

store. That's definitely a piece of technology we're looking at for SoFi's data platform and actively trying to figure out what we're doing there. Another 1 is this idea of a privacy vault

that some people are talking about.

And I think there are some really innovative approaches there. What we're thinking about is

we call it shifting left. We would like to be able to shift the obfuscation

of PII

as close to the point of production as possible.

And the way you could do that is, I think 10 years ago, the hope would have been for homomorphic encryption.

We tried we implemented a couple homomorphic operators in Dremel and in BigQuery, but

it just hasn't panned out yet. I'm not smart enough on the math to know if it ever will.

I don't think that fundamentally that's gonna be our saving grace. And we may put a note in the show notes if folks are not familiar with the idea of homomorphic encryption.

But essentially it is, I can encrypt

2 values

and then I can do operations

on the encrypted values,

which give me the encrypted version of the result I would have gotten from doing those operations

on the unencrypted values. So say, I take 2 numbers, I encrypt them, I add them together. The answer is the encrypted version

of the sum of those 2 numbers. But there's no way for me to back out at any point

from the encryption and get back to the original values.

So if homomorphic encryption can't save us from this, I think the next best thing is

tokenization

where I can get a deterministic

token for my value.

I can store that in a secure place

and only pass the token around within my ecosystem.

And if the tokens are deterministic, I can at least do equality comparisons between them. So this now gives me join and aggregation

without having to decrypt.

And if I need to do other operations, I can have differential privacy

in the vault at the point of access.

So now I don't have to worry about who gets access to my token. I can hand the token out willy nilly and when they go try to exchange it with the privacy vault,

the privacy vault can be the 1 that decides

do you get all the data? Do you get masked data? Do you get no data? Do you get access just to the privacy vault doing comparisons for you? This would be the model where I say, I know a token.

I know another thing. I want you to tell me if these 2 are the same thing and it can give you yes, no answers,

but it can't give you back the original data. Yeah. I've I've been reading about that a little bit recently myself as well. I've had somebody contact me with the potential of being on the show to talk about that idea as well, so definitely

spending some focus there.

And so now digging a bit more into your experience building at SoFi, you've mentioned a bit about some of the stack that you've built up, some of the things that you're looking towards. And I'm wondering if you can just talk to

how your experiences

working at Google on building Dremel, helping to grow BigQuery,

building some of the broader ecosystem

of data analysis and data tooling for the Google Cloud Platform,

how that has influenced the way that you think about platform design, organizational structure,

and sort of the prioritization

of

features and projects

and the valuation of data at SoFi and some of the ways that that earlier experience has led into where you are right now and the ways that you think about things. Yeah. There's 1 more thing about the time at Google that I would add that I think has probably had the biggest impact on how I'm thinking about things at SoFi.

So shortly after we launched the Dremel service back in 2010,

we began on a project which we eventually named Plex inside Google. It was p l x because, I guess,

vowels are expensive or something. The Plex system folks may have read. There there are some papers we published a little bit externally about it. But the idea was to build a unified data analytics platform.

And even back then, we had identified a number of components like a data catalog,

like data observability

tools, a robust

ad hoc query UI, dashboarding tools,

all the things that we would need to build. And of course, being Google and being that it was early in the evolution of these things, we built everything

ourselves.

And some pieces succeeded more than others and some took a long time to get to

to various points. I would say Plex is now very successful

inside Google.

And the reason I bring that up, I think

I have taken 2 lessons away from that experience

that

have shaped the way I'm thinking about it at SoFi.

1 is

I think we did a really good job of identifying the important components on the Plex project.

And when I came over to SoFi and as we have been building,

it has very much shaped my view of what's missing.

And so as I've come in, I've identified along the way when I first got here, there were 4 specific things I identified. There were gaps in our offering that we needed to fill in. We've since added a couple more.

But that idea of thinking about it in terms of

component systems

and how do they fit together,

and how do I build a complete

solution out of these component systems?

I think that was 1 thing.

But then on the other side, the negative example I took away from our Plex experience in shaping the way we approach things at SoFi is

it was extraordinarily

expensive

to try and build all that stuff ourselves.

Now, if you're Google, you can afford that. I mean, Google, you can even afford to build multiple redundant systems and compete with yourself in the internal market, which we did way more than we should have.

At a place like SoFi, I can't afford it. I can't afford to be building so much infrastructure. I think it's great when I look at

my web first peers like Lyft and like Airbnb

and Square and some of these that

I think have a little bit more of the luxury of building infrastructure

and tooling.

And that's great and I hope they keep doing it because we're really benefiting from their blogs and their learnings and the open source

and the startups that they're spinning out that we get to buy from.

But I think there are a lot more companies who are in buy position

at SoFi where building infrastructure

is not our business. Our business is Fintech.

And so we need to, I think,

be more in the business of putting together tools.

Yeah. And I think 1 more maybe, this is not a lesson I take away but

it's something that's a change for me from the time when I worked trying to sell BigQuery. Because for a lot of years, my job was trying to sell BigQuery to the biggest companies on the planet and convince them that we had all of their solutions.

And we very much our model was building a unified platform

that would that everyone could just leverage. And now being on the outside, I don't think that's reality. I think

it's much more the case than at least in the phase of development we're in right now,

everybody needs bespoke

solutions.

Everybody's making different trade offs. There are things that are right for SoFi

or that are necessary for SoFi, choices I have to make

that would not be right for

even a Robinhood

or a Chime or some of the ones that people might look at and say, well, they're pretty close to SoFi so they probably have the same requirements. But as I said, being a chartered national bank,

we have different requirements than they have and I'm gonna have to bring in heavier weight process

that would be appropriate for them.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility

and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast

atdataengineeringpodcast.com/rudder.

In addition to the work that you're doing at SoFi and the work that you've done at Google

and actually tying that to sort of what you've done at Google and some of the ways that that has influenced the trajectory

of the ecosystem of data tools, it has

opened up the idea of data marketplaces with 1 of the first ones being actually in BigQuery and being able to have public and shareable datasets where

I can host a dataset

and you can query it, but you're gonna be the 1 paying for the queries so that I don't

have to. And that has been copied as a model for Snowflake, and there are actually whole businesses being built up around

having shareable datasets to build up this sort of data sharing economy. And I'm wondering if you can

share your thoughts on what you view as some of the

benefits and potential future of these data marketplaces

as the technology

grows to support them more natively.

I am personally really excited about the idea of data marketplaces

and

the opportunity

I think we have

to turn the economics

of data

sort of on its head

and better align the incentives

of the data generators. Like you and me, the human beings who are going around using a software.

And the data

providers, the people who are collecting it and aggregating it and then reselling it, which is also me being at SoFi, we're collecting a lot of data. Right? We have a lot of data. So

we had a system

inside Google that could allow

data owners

to,

in real time, at execution time,

have a check on the sorts of queries that were being run over their data. Now obviously that had to be automated. It couldn't be a human being sitting there saying yes, no to every query. Though there were times where we had things that were that level of detail where, say, an ad agency wanted some really specific data and we would have human beings who could review

those queries and say yes or no. We will allow you to see this result set.

The reason that was important

in my mind is

I can have much more confidence

in sharing my data with you

if we're in the world we're in now. So it's no longer

I have to FTP you all of the data I have and then just trust that that will work, right? This was the initial temps attempts at data marketplace. I remember Amazon had 1 and Microsoft had 1. I can't remember who else might have. But it was effectively an FTP site with a credit card reader in front of it. And so I would have important data and I would put it on the site and then you would swipe your credit card and now you've got my data.

And that never

worked. And the reason it never worked is because if you try to sell data in that model and it's valuable data, you get to sell it 1 time.

And then I sell it to you and it ends up on a torrent site and now nobody else ever pays me for my data again and I've lost all control of my data.

And it also really only works for

primarily static datasets.

If the dataset updates every month, then you gotta pay me every month.

It becomes a much less interesting proposition for most people.

So now we've solved the problem of saying, I don't have to release my data to you. I can have a view, say, in front of my table and I can allow you to query that view and so I know you're only getting to these fields that I want and only with these particular

filters on them.

But at the next level then would be an idea where

I can let you run a query, but maybe you run a query and your query would only aggregate

over 3 individuals.

And I say, that's not enough. You could deanonymize

from 3 individuals. That's not gonna work. I need there to be at least a 100 people. And so you send a query and I say, nope, that query can't run. And then there are debates and discussions about how much information I give back to you when the query can't run

because sometimes you can learn things just from sending repeated failed queries,

but we can negotiate that.

So then if you think about extending that to the next level, imagine that individual consumers could determine what they were comfortable with their data being used for in various analyses.

And obviously, most people are not gonna have the technical wherewithal to figure this out. You solve that with associations,

groups basically who can say these are our principles. This is what we're okay with. This is what I'm not okay with. And then I can join these groups and say, okay, I'm okay with my data being used according to the principles of this group.

Once we get into a world like that, we can then allow

the data owners, the human beings who produced the data

to share in the benefits

of having their data resold. And share can come in all sorts of forms. The easiest 1 to think about is monetary. I can make some money by letting my data

go out there. Right? My data is just another asset that I own that can be put to work for me and the amount it gets used in queries that can come back to me.

So

I think that's the direction we could head with these data marketplaces.

There's a little bit more technology work obviously the ability for these on the fly at query execution time analyses to be injected

into the query processing stream. And then the idea of

these

associations

of like minded individuals.

For me, that's what excites me about data marketplaces.

That's definitely a very interesting approach to the overall question of how

to be able to bring that sort of monetization factor or the capture value back to the individual user because there have been debates about that for several years now, and most of the approaches that people

share are either impractical or impossible or just make no sense whatsoever.

But it there's definitely an interesting perspective on it. And then the other question too of being able to say,

you know, I'm not gonna answer your query because it's far too pointed. You know? You you need to have a more general query that is going to give you some value, but it's not going to give you exact information about this, you know, very small cohort of people that you're trying to target. So that could also help to

mitigate some of the issues that we have about things like ad retargeting and the, you know, very detailed dossiers that all of these ad agencies are able to build up about individuals. So it helps to solve a couple of those problems at the same time and definitely be interested to see how that might manifest in the years to come.

Yeah. I think it would also

probably put our industry on a better footing

relative to government regulation.

Right now, I think it's a shame that

we really only offer 2 options. Right? Either you give me all your data and I do whatever I want with it,

or you don't give me any data and you don't use my services.

And I think that that's a very hard choice. We've seen the laws that try to come in like, I forget the name of it. But whatever the law that makes me click allow all cookies on every single website I visit, every time I visit it. And I I think very few of us ever say, well, I'm not gonna use this website because I had to click that button. And so really it was that a benefit to anyone that we made that change. So I think that if we can get into

a world where

our incentives

are aligned with the data generators incentives,

then we will see much more sensible

regulation.

Absolutely.

And in your time working in industry and helping to build a lot of the systems that helped to

catalyze the broader ecosystem and now being a consumer and integrator of those systems,

what are some of the most interesting or innovative or unexpected

data processes or data systems that you've encountered or had the privilege to build?

Well, I think at least 1 of those for me is

this notion of data

observability, data testability, whatever you wanna call it. The idea that we're going to automate

expectations

about data and bring machine learning to kind of watch all of our tables

and make sure

that we know when things are changing.

I think that's an innovative 1. The notion of the privacy vault that we talked about, I think that's really

innovative and really important

that we find a solution to that problem.

Obviously, I think a lot of the stuff that we did on Dremel was very innovative at the time. But this industry, it just moves so fast that I think what we think is innovative

today, we just end up taking for granted

a few years from now. I go back and recruit at universities

a lot.

And over

15, 16 years of doing this,

it's just been really interesting to observe

how much the world has changed and how much things that

used to seem like the impossible blocker

when I was there. Like network was always going to be a problem for distributed computing. And

at some point in the last decade, it just became not a problem anymore. We've got more than enough of it. I have a hard time saying, I guess, what the innovative things are because

it's all changing so fast that I will probably leave something out that really was innovative at the time. And now I just think it's commonplace.

In terms of your experience of building and supporting these data systems, what are some of the most interesting or unexpected or challenging lessons that you've learned?

I think the challenging lessons that I've seen are all people and business problems.

The technical stuff,

1 of my favorite engineers I worked with at Google had had a habit of

saying it's software. We can do literally anything. Anytime someone would would say, well, I don't know if we can do that. He would his rejoinder would be his software. We can do absolutely anything we want. We just have to decide whether we want to do that.

So the difficult things for me or the complex challenges are

figuring out the structures of organizations,

figuring out

how we can get people all pushing in the same direction

on our data problems

and really make it easier to do the right thing than to do the wrong thing.

Because if you make it hard for

engineers to produce

reliable analytic data or you make it hard for data scientists

to

consume

from the

canonical

blessed tables,

they'll just do something else. And then all your work on the technology

side

to try and make things good just goes out the window because the human beings are gonna do what they're incentivized

to do.

Are there any other aspects of your work on Dremel and at BigQuery and SoFi

or any other forward looking pontifications that you'd like share that we didn't discuss yet that you'd like to cover before we close out the

show? Yeah. I don't think so. I'm not a real pontificating

kind of guy. You need other people for that. I think

what I will say is we are at a great time

in

data analytics. The cost

of launching new

data focused businesses

is as low as I've ever seen it with this competition

from Snowflake and BigQuery.

The number of tools that are out there is so great that I think we're really at a place where

there's a big opportunity

for

some names to be made here

and for a decade from now for us to be talking about the people who revolutionized

data governance or data cataloging or data quality or or whatever it is. I'm just really excited to see where that goes. And I'm kind of excited to be

now seeing it from the other side of the sideline, being the person who's using it rather than the person who's having to worry about building it.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?

The biggest 1 we are seeing right now is that privacy vault that we discussed or some comparable solution.

But something that makes

handling PII correctly

easier than handling it incorrectly.

That is the 1 I would love to see people solve.

Well, thank you very much for taking the time today to join me and share the work that you've been doing over the past decade plus and the work that you're doing now. So I appreciate all the time and energy you've put into helping to grow the community and inspire the surrounding ecosystem and now being a consumer of said ecosystem. So thank you again for all of the time and energy you've put into that, and I hope you enjoy the rest of your day. Alright. Thanks, Tobias. Talk to you later.

For listening.

Don't forget to check out our other show, podcast dot in it@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcastdot

com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links