Summary
Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel
Interview
-
Introduction
-
How did you get involved in the area of data management?
-
Can you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there?
-
Dremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google?
-
You were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach?
- How well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper?
-
Following your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem?
-
How have your experiences at Google influenced your approach to platform and organizational design at SoFi?
-
What’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house?
-
How does your team solve for data quality and governance?
- What are the dominating factors that you consider when deciding on project/product priorities for your team?
-
When you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there?
-
You also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space?
-
What are the most interesting, innovative, or unexpected data systems that you have encountered?
-
What are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems?
-
What are the areas that you are paying the most attention to?
-
What interesting predictions do you have for the future of data systems and their applications?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- SoFi
- Bigquery
- Dremel
- Brigham Young University
- Empirical Software Engineering
- Map/Reduce
- Hadoop
- Sawzall
- VLDB Test Of Time Award Paper
- GFS
- Colossus
- Partitioned Hash Join
- Google BigTable
- HBase
- AWS Athena
- Snowflake
- Data Vault
- Star Schema
- Privacy Vault
- Homomorphic Encryption
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode
[00:01:46] Unknown:
today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dan Delore about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and an early engineer on the Dremel project. So, Dan, can you start by introducing yourself?
[00:02:11] Unknown:
Yeah. Hi, Tobias. I am the vice president of data at SoFi. Currently, I've been here a little over a year working on building the data platform for SoFi. Prior to that, my all my real professional experience was at Google. I worked at Google for 14 years. Most of the time on Dremel and then later BigQuery when the teams were combined. So your audience is probably familiar with the Dremel paper. BigQuery is the externalization of Dremel into Google's cloud. Prior to that, I was doing my graduate work at Brigham Young University. And do you remember how you first got started working in the area of data? It was during my PhD program. Of course, back then in the early 2000, we called it data mining.
And so I was studying empirical software engineering, my emphasis, and was in the area of mining software repositories. So what we were doing was scraping data from all of the artifacts of open source software that we could find, indexing it, and then trying to do interest analysis of that. When I went to Google, I worked on ads optimization, which was, again, sort of a use of big data problem. We were trying to build keywords suggestion so that advertisers could get help in building out their advertising campaigns. After 2 years on that, I had the opportunity to join the Dremel team as an engineer.
[00:03:43] Unknown:
From there, it was all big data for me. Can you start by sharing a bit about what your current relationship is to the overall data ecosystem and maybe the CliffsNotes version of how you ended up there, which you gave us a little bit, but maybe a little bit more sort of detail about the different juncture points along the way. Chris Krasner (3six 0 six): Yeah. At this point, I see myself as a consumer of the tools. That's my relationship to the ecosystem. I love all of the
[00:04:08] Unknown:
explosion of tools in the modern data stack. I think that we've come a long way in just the last decade. And so it's really exciting for me to be out here getting to use the tools and see how these things fit together. I think during my time on BigQuery, I had a very deep but somewhat narrow view of the world in terms of seeing query execution and data storage as the primary drivers of the problem. And really, now that I'm on the other side, I see that it's not really my primary concern day to day, the exact optimal performance of a given query.
The bigger problems are much more finding data, making sure data is reliable, monitoring SLAs about data delivery, answering business user questions. And so I've seen some other pieces of the data stack that I think there's some cool innovation happening, but we still have a lot of work to do.
[00:05:09] Unknown:
And now jumping back to sort of earlier in your career, as you said, you were 1 of the early engineers on the Dremel project, and that was, to my understanding, Google's next generation of data processing after their work on MapReduce and, you know, as Hadoop was starting to become widespread and mainstream in the ecosystem outside of Google, Google had already moved on from that paradigm and started working on Dremel, which has come to be more of the sort of agreed upon better approach. And I'm wondering if you can just share some of your perspective and context on that stage of the data ecosystem and maybe some of your experience of working on Dremel and the concepts that it was encapsulating as compared to the MapReduce paradigm that was starting to grow and be fostered outside of Google.
[00:05:57] Unknown:
Yeah. Absolutely. And, yes, I think you've framed that exactly right. The original engineer, Andre Guverev, who started the Dremel project inside Google, He was trying to solve the problem of boiler plate code, long startup times, difficulty in chaining steps together, difficulty in writing the programs, all the things that we know about MapReduce, the initial ecosystem. At Google, there was a language called sawsall, which some folks may have heard of. You could think of it as like parallelized Python that could compile down to map reduces. And so he was trying to solve the problem of how could I just write a SQL query and how could I have it run really fast.
It was fairly rewarding. In 2020, we got to write the VLDB test of time award paper for Dremel. The original Dremel paper was published in 2010. And so in 2020, it it won the test of time award. And when we went back and looked at that and tried to break it down, broke the contributions of that Dremel made to the industry down into sort of 5 categories. 1st was bringing sequel back. People who are operating today or or have started since the kinda Dremel brought it back may not realize that in the late 2000, early 2010s, it was believed that SQL was not a proper language for big data and it wasn't gonna work. And I think now we can see with Snowflake, with Redshift, with Athena, with BigQuery, that everybody's all in on SQL as the right language even for very large datasets.
The second thing for Dremel was disaggregation of compute and storage. The idea that I can scale 1 without scaling the other. And I would put in there along with that the horizontal scaling. The idea that I don't need ever larger computers to be able to handle more data, but I can have just a fleet of computers. A lot of that was following for the MapReduce work, obviously. The idea of in place data analysis that I would query data wherever it happened to reside, for us, that primarily meant in Google's distributed file system, initially GFS and then later Colossus.
And the idea of serverless that the user running the query shouldn't have to worry about whether the servers were running or starting them up. They should just be there and be able to run the query. And then of course, columnar storage. That was 1 of the big innovations of the Dremel project. The idea that you could do columnar storage but still allow nested repeated data. For us, inside Google, that meant protocol buffers for much of the world. Now that means JSON, but it's the same idea. The idea that I could still give you an efficient query over nest bit repeated data without you having to normalize everything.
[00:08:49] Unknown:
The other key aspect of Dremel was that it was still a divergence from the sort of then traditional data warehouse paradigm of everything being highly structured, very rigid, but, you know, easy to be able to answer a fixed set of questions and instead being able to support unstructured datasets, querying data where it lives, to some degree, breaking down the concept of data gravity as a blocking factor in what you can do with the information. And I'm wondering if you can just talk to what you see as being the broader impact on the ecosystem of allowing for that sort of data federation working with semi structured and somewhat unstructured datasets and being able to then, you know, allow for this new paradigm of extract load transform that was prior to that point largely intractable.
[00:09:39] Unknown:
Yes. I think that's true. I think there were a couple of things there. 1 thing that we always focused on with Dremel was making the queries fast. It was supposed to be an interactive system. The majority of queries we ran, ran under a minute, many under 10 seconds. And it really changed the paradigm from a MapReduce where I would code up my couple classes, my mapper and reducer, and then I would kick off inside Google board job to start up however many instances of my objects I needed. And then I'd wait for my analysis to run and it might take a couple hours. And so I would go play a game of pool or something and come back later and see the answer. Once we got to the Dremel service, I should say, the initial versions of Dremel, the first couple years of Dremel, you did have to prepare your data ahead of time and you had to load it on the local disk on the machines and you had to start up the servers. And so that, it was not great for adoption at that point. But at the point we moved to the Dremel service where we had a standing tree of servers tree because we did aggregation trees at that time. You'll see in more recent papers that we've changed the architecture there. But the idea that you didn't have to do anything, and as long as your data was sitting in the remote file system, you could issue a query and that query would execute immediately and get you the results back so quickly that you didn't even have time to write the next query before the answer came back.
It became really powerful. What I found is as people get access to tools and more data, it really only ever leads to more questions that they want to ask. And so there's always this inflection point. We saw it all the time when we were selling BigQuery into enterprises that they're just this inflection point in data consumption in these large organizations when the users get the ability to ask questions and get answers faster. So I think that's 1 of the big ways that it changed the dynamics.
[00:11:41] Unknown:
I'm not as up to speed with the architecture of Dremel as I am with systems such as Presto, which are, I guess, the spiritual successors at least of what you were building there. But I know that in order to be able to query across these different datasets, you know, whether they're living in s 3 or Hadoop file system or the Google file store. You need to have the metastore to understand, you know, what are the files on disk, what is the schema of those files so that I can be able to structure the queries and be able to, you know, build the query plan to understand, okay. These are the files that I actually need to fetch. These are the operations I need to perform in them, etcetera. And I'm curious if Dremel has that same architectural component of needing to be able to schematize and sort of maintain that metadata ahead of time before you're actually able to execute those queries or if you have a different system for being able to propagate that information to the query planner to be able to understand the actual sequence of operations that are necessary to be able to satisfy a given SQL structure?
[00:12:43] Unknown:
So today, the answer is yes. In what Dremel has become, BigQuery. BigQuery does have a metadata store. It does have managed storage, and it does use some of that information in query planning less than what you would probably expect relative to other systems. But that's because adrenaline, we didn't have anything like that. We never built a metadata store. The files were self describing. And we had this interesting dynamic for a lot of years on the internal system, where much of the data we were querying, we queried exactly 1 time.
Meaning someone issued a query, we read that file for the very first time and the very last time we were ever gonna read that file. And that sets up a really interesting dynamic relative to the creation of this metadata and the computation of statistics that often get used in cost based optimization. If you are going to double the expense, right, from I'm only ever gonna read this file once to now I'm gonna read this file twice, you would have to bring some benefit that essentially completely removed the cost of of reading the file at all the second time. Because the first time when you're computing statistics, you're always gonna have to do a full table scan, Right? And so I think that was 1 of Andre's really clever insights and innovations when he started Dremel was just to say we're gonna make this thing performant without ever having to precompute anything.
And that led to a lot of years of really simple usage for the users when you could just point the tool at your dataset, not have to wait for any pre computation or load or anything and have the query still run at interactive speeds. Now, the way we did that, of course, we didn't have magic. The way we did that was we threw tons of resources at the problem. Right? So you just get a lot more CPUs, you get a lot more network. Really, I think the network was the key to why Dremel succeeded. I know some people have asked why was Dremel so successful inside of Google, but projects like, for example, drill were not equally successful outside of the market.
I think 1 of the main reasons was the innovation on the networking inside Google. We did everything we could to saturate the network inside the data centers we ran in. 1 of the things that people outside the Dremel team didn't know was that we were never very good citizens in terms of resource usage. But it was okay because everybody was using Dremel. So they were all benefiting from us abusing every loophole we could find.
[00:15:19] Unknown:
The other aspect that Dremel has led to is this idea of data federation where you have this query engine that, as you said, is decoupled from the storage layer, but that also gives you the opportunity to make the query execution pluggable so you're not necessarily just working with files on disk. You might also be, you know, translating your SQL statement into a different statement for a different relational database or a non relational database and then being able to aggregate data across multiple different systems to be able to build analyses across them so that you don't necessarily have to do this extract and load process to be able to actually answer questions. You can just say, I'm gonna query the data where it lives, and I'm wondering what you see as the sort of broad impact that that has had on analytical capabilities, both in terms of what tools like drill and Dremel and Presto are able to do, but also in terms of the ways that we think about how to build data systems and how to think about the contracts between data producers and owners and the downstream consumers of that data?
[00:16:25] Unknown:
Great question. Lots of stuff to dig into there. So 1 thing I would say is in addition to just being able to query the federated data, I think the revolution that came about because of us having done that in Dremel, at least with Inside Google, the revolution that came about was that suddenly people expected to be able to join datasets even if the data didn't reside in the same system. So prior to that, things like MapReduce the initial MapReduce, of course, didn't join anything. You had 1 dataset and you ran it. You you did a map and a reduce and you produced an output and unioning things even was hard. But joining was certainly hard. So Dremel, because we launched join and in order to be able to join, we had to be able to do it on the fly, We called it shuffle. That's a repartitioning operation so that you can repartition the data to get the right join keys. And, you know, our primary join strategy was always partition hash join. We tried a couple other. We tried lookup join. Certainly, we do have functionality for broadcast join if 1 of the datasets is small, and we did a lot of work to push small up to, like, the size of a gigabyte.
But there were always limitations. So our primary mode was the shuffle hash join. And once you were able to read federated data and you could still do a join with an on the fly partition in memory and that was your core strategy, it opened up the ability to join data from wherever it happened to be. So for us, we use Bigtable, obviously, the the internal version of of HBase. But when you had a Bigtable that you could join to a file in Colossus, that you could join to a query result from a MySQL database or an F1 query or something like that, it really changed people's expectation and I think getting back to the idea that data just engenders more questions, it really opened up people's eyes to the possibilities of look at all these interesting analyses I could do that now require no work for me to pre process the data or do a multi phase orchestration with all sorts of different transformations. I can just express it as a common SQL statement using the join operator I'm used to. And under the hood, Dremel takes care of scheduling all of that, building the query plan.
So I think that was 1 way in which it changed the paradigm where people were able to expect to join everything. And then it went to the next level when we rolled out BigQuery and now you see the same thing evolving with Snowflake's data marketplace. But the idea that the entire world's data analysis system could be 1 global system where all the data was joinable from its original place at rest. So there was no need to copy or get stale redundant versions of your data anywhere. I could just share my table with you and you would be able to query it.
The second part though of your question, because you asked the question then what did that do to the relationship between data producers and consumers? I think it actually sort of broke the relationship and it's 1 of the things we're struggling with now at SoFi, figuring out how to rein that back in. And I didn't always see it when I was inside Google. I think in the early days, you could accurately describe what we built with Dremel as a data lake or at least an early version of a data lake. And then when we started to roll out BigQuery, we started by trying to keep that same data like paradigm. And we discovered after a few years that it just it wasn't working for organizations, and we had to become much more like a data warehouse in BigQuery moving away from the data lake paradigm.
And I'll explain what I mean by that. In a data lake, I think the governance and the agreements between data producers and data consumers are hard to maintain. It lowers the barrier for entry for me as a consumer to just be able to query any data at rest as long as I can get access to that data. But it also leads to me potentially accidentally taking dependencies on things that the data producer has no interest in guaranteeing. And so schemas can change out from under me, data can disappear, it can be low quality or unreliable data, and I have no way of knowing that. And so I think there is now back toward the data warehouse paradigm with things like Snowflake and BigQuery, trying to keep all the advantages of federated data and all of that, but making it clear that the data that I actually need to be reliable because it's going into my regulatory filings or it's being shown back to my end users or something like that. I do think we're going to see a big move back to I want that data to be controlled and monitored and reliable.
All of the things that I used to get from the really strict ETL pipelines owned by the central data team with guarantees about reliability and freshness and all of that.
[00:21:46] Unknown:
With that shift from the sort of data lake paradigm to where we are with data warehouses and now the up and coming term being the data lake house where it's sort of this hybrid of, I can dump all the data that I want, but it has to be at least semi structured and cleaned before I'm gonna bother querying it and exposing it to other people. I'm wondering what your view is as somebody who has helped to build a lot of the foundational tools and concepts that led to these capabilities. Now being a consumer of them, like, how that impacts your sort of design sense and how you think about the appropriate way to build data platforms at scale in a way that is sort of flexible and agile, but structured enough to be able to actually have reliable outputs and high quality data?
[00:22:29] Unknown:
That's exactly the right place to go with this. And I would throw 1 more thing into the mix. I know it's a bit of a fraught topic. So I'll just say the data mesh idea, regardless of whether you take the canonical definition or the more loose definition as I prefer, but this idea of distributed data ownership, that's really what we're pursuing at SoFi, and I think a lot of my peers and other organizations are pursuing as well because they're recognizing that trying to staff up a central data engineering team that owns all of the phases of data from production all the way to consumption isn't going to work.
We are, I think, a little bit like most organizations, kind of fumbling in the dark. We're trying to figure out what works for us and how we can guarantee some of the things we need. We have being in financial technology and also having just gotten a bank charter. So we're not just like playing in Fintech, but we're an actual bank regulated by the OCC and other organizations. We have some heavier audit requirements and regulatory requirements than I think we were used to in the past. And so that's leading us to try and find ways to still be able to distribute, but still provide guarantees for certain datasets.
The organizational structure that we've landed on is to have our engineering teams responsible for the production and ingestion of their data all the way into what we call the cleansed area in our data warehouse. We conceptualize our data warehouse in 4 zones. The first is raw. That's where the ingested data lands and there are really no guarantees. You can think of this like the data lake component. It's private by default. The data may be schema less, meaning they're just dumping JSON blobs into variant types in Snowflake and then cleaning it up as the first phase in their their ELT process. But our goal is for all analytic data or all potentially interesting analytic data to land in the raw zone in Snowflake.
So we are putting all of the data inside Snowflake in the raw zone. This is similar in principle to what other folks are doing by putting landing in s 3 first and then transforming from there using either snowflakes ability to query directly from s 3 or using a query engine like Athena, something like that. For us, I wanna shut down the multiple tools, multiple sources of truth problem and so we are putting it all inside Snowflake, inside private schemas so that people cannot accidentally be getting to each other's raw data. The second zone we call cleansed. Cleansed is where you impose a schema.
You do whatever cleaning is necessary, deduplication, standardization of data types, data content, probably some introduction of synthetic join keys to support the next phase in the process. And the cleansed layer is where we expect there to be a contract between the data producers and the data consumers. Meaning, if I put a field in there, I'm not gonna just pull that field away without some automated testing being able to catch me. For example, we don't give any group direct DDL or DML access to their cleansed schema. The only way you can update the schemas or the data inside your cleansed schema is via some automated process, which requires GitLab, it requires our CICD pipelines to run.
And so your consumers always have the ability to detect if you're changing something. The next zone for us on top of cleansed, we call modeled. Pretty self explanatory, it's where you build your data models. There will be some amount of join aggregation there. We'll have all sorts of flavors in there from star or snowflake schemas, fact and dimension tables, just flat broad tables. And we're doing it still in a distributed way there where we do have a central core data warehouse which is the most important piece. But then around it, we have what are called team data marts where individual groups for the different vertical business units we have can be building their own data models. And then the last zone we have on top of all of that we call summarized.
That's where you build your reporting layer. So those should be the tables that are built for optimizing some report. In general, the pipeline we think will flow where data science, they can prototype reports directly off the model schema. But then if they find the SQL query for the report they want, they'll probably turn that into the equivalent of a materialized view or base tables with aggregates or something that allows them to do very simple reporting. And we'd like to pull most of the business logic out of our business intelligence tool, which is Tableau, and keep it all in the data warehouse.
[00:27:49] Unknown:
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar. You'll also get a swag package when you continue
[00:28:35] Unknown:
The model in question is something that has been interesting to me for a while now because of the fact that since we do have these very powerful and flexible engines that allow us to be able to query across structured and unstructured datasets, and we have the power to be able to just kind of throw everything into a giant table and be able to run aggregates across it. We don't necessarily have to be as rigorous as we used to be when we were all relying on a, you know, a massive Oracle server and worrying about the licensing there or as, you know, Microsoft SQL Server and having to do these, you know, stars, snowflake schemas, or, you know, even data vaults. And I'm wondering what you have seen as the impact of these newer generations of technologies on the ways that people think about data modeling and the sort of rigor that they place in terms of how they're actually structuring their schemas versus just saying, okay. Well, this is all on a table. I can run a select statement that gets me what I want. Want. So, you know, good enough. I'm off to my next task.
[00:29:31] Unknown:
I think it's been a really good thing in terms of getting more analysis inside the data warehouse. As you say, now that we're not constrained by the resources of running on a single Oracle or Postgres box or whatever, and having to lock people out during certain times so core processes could run. That piece of it has been fantastic. What I'm encountering since I came over to SoFi, which actually never occurred to me when I was building the tools at Google, and it probably should've but it just didn't click until I was the 1 doing it. There were other advantages to the old model and we looked just at the cost and we said well, it was about resource constraint and now I don't have the resource constraint. So now I don't have to do any of that other stuff anymore.
1 of the problems with that is if everybody is so take our model, for example. If we were to tell everybody we only have 2 zones, we have raw and cleansed. And then because we use Snowflake and Snowflake is so fast and we have so many resources, everybody's just gonna query directly against the cleansed data, which is still largely in a normalized schema. And then you're gonna produce your reports directly from there. What you get then is a proliferation of business logic into everybody's report and dashboard. And the bigger your organization gets and the more distributed, the higher the likelihood that there are gonna be disagreements.
And we see this in spades at SoFi right now. Executives getting 3 different reports that all claim to be talking about the same thing but showing different numbers. Because either they've used different filters and so their date ranges are slightly different or their criteria for the cohorts that they're including are slightly different. Or they've joined in a different order or they've used a left outer join instead of a right outer join or they did an inner join where the other guy did a a right outer join. And so we've got all these different layers at which the reliability of the data can break down.
And so I think right now what we're seeing is actually a pushback in the other direction where we went in 1 direction because we now had all this new flexibility because we had Snowflake and we had these tools that could do things really quickly. And now what we're seeing is we need to reign that in and we need to find a way to keep the goodness of being able to run a ton of analysis and not having data engineering be the bottleneck of the organization, but also having some sort of gatekeeping and guardrails that prevent people from inadvertently making mistakes.
[00:32:14] Unknown:
Well, clearly, the solution is that we just build a new category of tools and say that that's gonna solve all of our problems. Right? That's what the metric layer is for. That's right. Yes. The metric story is definitely what we're looking at and I would throw 1 other 1 in there, which is the data quality tools.
[00:32:28] Unknown:
That's for us sort of the most recent piece we've brought in to our system. And I think that as we get more rigorous about testing and monitoring data quality so that our end consumers are not our canaries in the coal mine, It's very much, I think, akin to the transformation we went through as an industry over the last couple decades toward expecting engineers to always write unit tests on everything they were running and expecting those tests to be run. I don't know other companies if they're quite the same at Google. I know mostly about Google but at Google, the system we have runs every affected test at every single change list. And so you always know if you broke something and it does it before you submit. You're not even allowed to submit code that would break tests. I would love to see us at SoFi get to a point with these data observability tools where people were not unknowingly pushing schema changes that broke their downstream consumers or changing data values that invalidated some report.
[00:33:39] Unknown:
Jumping back a little bit to the question of data federation, I'm also interested in your take on how you see that impacting the sort of questions of data governance and the sort of formulation and application of ethical priorities on how that data is being used and consumed and and remixed and some of the ways that you're able to guard against or account for bias in the ways that you're building your analyses because of the fact that more people have more access to more data?
[00:34:10] Unknown:
I think there's a lot in there. So federated data, I think that has been 1 of the downsides of it, that you lose some ability to govern. In some senses, that's okay because not all data has to be so tightly governed and controlled. You want there to be flexibility and experimentation. But I think there are places where it runs into problems. I think it's particularly hard to govern and control PII in a data lake environment. Particularly if that PII is sometimes stored in, schema lists or schema on read fields. So I'm saying here, for example, I have seen situations where PII is being stored inside JSON blobs, inside files written to s 3.
That becomes a real danger for governance and for compliance because if it's stored at least in a structured environment, what I can do is read the first few rows of a dataset and identify it looks like the values in this field contain a social security number or address or something. But if I've got JSON blobs, which do not have strictly defined schemas, I can't read just part of that data. I have to read all of that data and I have to parse every JSON block and look through every field to say, do any of these look like PII? That becomes a cost prohibitive expense, I think, for many organizations.
So I think 1 of the important way you asked about ethical considerations. 1 of the important things we have to think about is as we take advantage of these new tools that have huge scale and the ability to give us greater flexibility, we have to think about whether that flexibility is being applied to data which is safe or data which could potentially leak. I'll throw 1 more on that if I can. It's a little divergence from the question you asked but getting back to this idea of joins, I think that has been the real innovation in the last decade since Dremel. The idea that I can join data from all sorts of different datasets.
And I think historically, when people think about data breaches or misuses of their data, they think primarily about leaks, about exposing this is like the Equifax thing. Right? Someone's gonna get a hold of my 140, 000, 000 records, and then they'll know all the data that I need. But I think what we're seeing now is the actual bigger danger is not that they get the data that you had, but that they get your data and they join it to some other data set that you hadn't even considered. We've seen leaks like this for example, right? The the idea that American military personnel wearing fitness trackers that were accidentally hooked up still to Strava and continuing to do their exercise regimen even though they're deployed to secret bases around the world, suddenly draws a map on a publicly accessible map that says, hey, there is something going on in this region of pick your favorite foreign country that you don't want adversaries knowing about military bases in. Or the New York taxi data that was released and then read it found, hey I can join this data with paparazzi photos and I can find out where famous people are taking taxi rides to and from and where they tend to be at a given date and time.
So I think we need to get in the mindset not just of saying I secure my own data but what is the absolute worst thing someone could do if they had my data and random dataset x that they could join it to? Yeah. And that's giving rise to another sort of subindustry
[00:38:07] Unknown:
of this idea of privacy engineering where you have before you actually expose any dataset, even whether internally or externally and share it, you actually go through and obfuscate the data or, you know, mask it or, you know, add in some random noise to make sure that you can guard against with reasonable expectation against these re identification attacks or, you know, these, you know, data joining attacks where you say, you know, this 1 piece of information is innocuous in isolation, but when I join that with, you know, US census data, actually, I can see exactly who this person is.
[00:38:42] Unknown:
Yep. Yep. And there's another thing. So you mentioned the metrics store. That's definitely a piece of technology we're looking at for SoFi's data platform and actively trying to figure out what we're doing there. Another 1 is this idea of a privacy vault that some people are talking about. And I think there are some really innovative approaches there. What we're thinking about is we call it shifting left. We would like to be able to shift the obfuscation of PII as close to the point of production as possible. And the way you could do that is, I think 10 years ago, the hope would have been for homomorphic encryption.
We tried we implemented a couple homomorphic operators in Dremel and in BigQuery, but it just hasn't panned out yet. I'm not smart enough on the math to know if it ever will. I don't think that fundamentally that's gonna be our saving grace. And we may put a note in the show notes if folks are not familiar with the idea of homomorphic encryption. But essentially it is, I can encrypt 2 values and then I can do operations on the encrypted values, which give me the encrypted version of the result I would have gotten from doing those operations on the unencrypted values. So say, I take 2 numbers, I encrypt them, I add them together. The answer is the encrypted version of the sum of those 2 numbers. But there's no way for me to back out at any point from the encryption and get back to the original values.
So if homomorphic encryption can't save us from this, I think the next best thing is tokenization where I can get a deterministic token for my value. I can store that in a secure place and only pass the token around within my ecosystem. And if the tokens are deterministic, I can at least do equality comparisons between them. So this now gives me join and aggregation without having to decrypt. And if I need to do other operations, I can have differential privacy in the vault at the point of access. So now I don't have to worry about who gets access to my token. I can hand the token out willy nilly and when they go try to exchange it with the privacy vault, the privacy vault can be the 1 that decides do you get all the data? Do you get masked data? Do you get no data? Do you get access just to the privacy vault doing comparisons for you? This would be the model where I say, I know a token.
I know another thing. I want you to tell me if these 2 are the same thing and it can give you yes, no answers,
[00:41:21] Unknown:
but it can't give you back the original data. Yeah. I've I've been reading about that a little bit recently myself as well. I've had somebody contact me with the potential of being on the show to talk about that idea as well, so definitely spending some focus there. And so now digging a bit more into your experience building at SoFi, you've mentioned a bit about some of the stack that you've built up, some of the things that you're looking towards. And I'm wondering if you can just talk to how your experiences working at Google on building Dremel, helping to grow BigQuery, building some of the broader ecosystem of data analysis and data tooling for the Google Cloud Platform, how that has influenced the way that you think about platform design, organizational structure, and sort of the prioritization of features and projects and the valuation of data at SoFi and some of the ways that that earlier experience has led into where you are right now and the ways that you think about things. Yeah. There's 1 more thing about the time at Google that I would add that I think has probably had the biggest impact on how I'm thinking about things at SoFi.
[00:42:27] Unknown:
So shortly after we launched the Dremel service back in 2010, we began on a project which we eventually named Plex inside Google. It was p l x because, I guess, vowels are expensive or something. The Plex system folks may have read. There there are some papers we published a little bit externally about it. But the idea was to build a unified data analytics platform. And even back then, we had identified a number of components like a data catalog, like data observability tools, a robust ad hoc query UI, dashboarding tools, all the things that we would need to build. And of course, being Google and being that it was early in the evolution of these things, we built everything ourselves.
And some pieces succeeded more than others and some took a long time to get to to various points. I would say Plex is now very successful inside Google. And the reason I bring that up, I think I have taken 2 lessons away from that experience that have shaped the way I'm thinking about it at SoFi. 1 is I think we did a really good job of identifying the important components on the Plex project. And when I came over to SoFi and as we have been building, it has very much shaped my view of what's missing. And so as I've come in, I've identified along the way when I first got here, there were 4 specific things I identified. There were gaps in our offering that we needed to fill in. We've since added a couple more. But that idea of thinking about it in terms of component systems and how do they fit together, and how do I build a complete solution out of these component systems?
I think that was 1 thing. But then on the other side, the negative example I took away from our Plex experience in shaping the way we approach things at SoFi is it was extraordinarily expensive to try and build all that stuff ourselves. Now, if you're Google, you can afford that. I mean, Google, you can even afford to build multiple redundant systems and compete with yourself in the internal market, which we did way more than we should have. At a place like SoFi, I can't afford it. I can't afford to be building so much infrastructure. I think it's great when I look at my web first peers like Lyft and like Airbnb and Square and some of these that I think have a little bit more of the luxury of building infrastructure and tooling.
And that's great and I hope they keep doing it because we're really benefiting from their blogs and their learnings and the open source and the startups that they're spinning out that we get to buy from. But I think there are a lot more companies who are in buy position at SoFi where building infrastructure is not our business. Our business is Fintech. And so we need to, I think, be more in the business of putting together tools. Yeah. And I think 1 more maybe, this is not a lesson I take away but it's something that's a change for me from the time when I worked trying to sell BigQuery. Because for a lot of years, my job was trying to sell BigQuery to the biggest companies on the planet and convince them that we had all of their solutions. And we very much our model was building a unified platform that would that everyone could just leverage. And now being on the outside, I don't think that's reality. I think it's much more the case than at least in the phase of development we're in right now, everybody needs bespoke solutions.
Everybody's making different trade offs. There are things that are right for SoFi or that are necessary for SoFi, choices I have to make that would not be right for even a Robinhood or a Chime or some of the ones that people might look at and say, well, they're pretty close to SoFi so they probably have the same requirements. But as I said, being a chartered national bank, we have different requirements than they have and I'm gonna have to bring in heavier weight process that would be appropriate for them.
[00:46:42] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder.
[00:47:17] Unknown:
In addition to the work that you're doing at SoFi and the work that you've done at Google and actually tying that to sort of what you've done at Google and some of the ways that that has influenced the trajectory of the ecosystem of data tools, it has opened up the idea of data marketplaces with 1 of the first ones being actually in BigQuery and being able to have public and shareable datasets where I can host a dataset and you can query it, but you're gonna be the 1 paying for the queries so that I don't have to. And that has been copied as a model for Snowflake, and there are actually whole businesses being built up around having shareable datasets to build up this sort of data sharing economy. And I'm wondering if you can share your thoughts on what you view as some of the benefits and potential future of these data marketplaces as the technology grows to support them more natively.
[00:48:09] Unknown:
I am personally really excited about the idea of data marketplaces and the opportunity I think we have to turn the economics of data sort of on its head and better align the incentives of the data generators. Like you and me, the human beings who are going around using a software. And the data providers, the people who are collecting it and aggregating it and then reselling it, which is also me being at SoFi, we're collecting a lot of data. Right? We have a lot of data. So we had a system inside Google that could allow data owners to, in real time, at execution time, have a check on the sorts of queries that were being run over their data. Now obviously that had to be automated. It couldn't be a human being sitting there saying yes, no to every query. Though there were times where we had things that were that level of detail where, say, an ad agency wanted some really specific data and we would have human beings who could review those queries and say yes or no. We will allow you to see this result set.
The reason that was important in my mind is I can have much more confidence in sharing my data with you if we're in the world we're in now. So it's no longer I have to FTP you all of the data I have and then just trust that that will work, right? This was the initial temps attempts at data marketplace. I remember Amazon had 1 and Microsoft had 1. I can't remember who else might have. But it was effectively an FTP site with a credit card reader in front of it. And so I would have important data and I would put it on the site and then you would swipe your credit card and now you've got my data. And that never worked. And the reason it never worked is because if you try to sell data in that model and it's valuable data, you get to sell it 1 time.
And then I sell it to you and it ends up on a torrent site and now nobody else ever pays me for my data again and I've lost all control of my data. And it also really only works for primarily static datasets. If the dataset updates every month, then you gotta pay me every month. It becomes a much less interesting proposition for most people. So now we've solved the problem of saying, I don't have to release my data to you. I can have a view, say, in front of my table and I can allow you to query that view and so I know you're only getting to these fields that I want and only with these particular filters on them.
But at the next level then would be an idea where I can let you run a query, but maybe you run a query and your query would only aggregate over 3 individuals. And I say, that's not enough. You could deanonymize from 3 individuals. That's not gonna work. I need there to be at least a 100 people. And so you send a query and I say, nope, that query can't run. And then there are debates and discussions about how much information I give back to you when the query can't run because sometimes you can learn things just from sending repeated failed queries, but we can negotiate that.
So then if you think about extending that to the next level, imagine that individual consumers could determine what they were comfortable with their data being used for in various analyses. And obviously, most people are not gonna have the technical wherewithal to figure this out. You solve that with associations, groups basically who can say these are our principles. This is what we're okay with. This is what I'm not okay with. And then I can join these groups and say, okay, I'm okay with my data being used according to the principles of this group. Once we get into a world like that, we can then allow the data owners, the human beings who produced the data to share in the benefits of having their data resold. And share can come in all sorts of forms. The easiest 1 to think about is monetary. I can make some money by letting my data go out there. Right? My data is just another asset that I own that can be put to work for me and the amount it gets used in queries that can come back to me.
So I think that's the direction we could head with these data marketplaces. There's a little bit more technology work obviously the ability for these on the fly at query execution time analyses to be injected into the query processing stream. And then the idea of these associations of like minded individuals. For me, that's what excites me about data marketplaces.
[00:52:46] Unknown:
That's definitely a very interesting approach to the overall question of how to be able to bring that sort of monetization factor or the capture value back to the individual user because there have been debates about that for several years now, and most of the approaches that people share are either impractical or impossible or just make no sense whatsoever. But it there's definitely an interesting perspective on it. And then the other question too of being able to say, you know, I'm not gonna answer your query because it's far too pointed. You know? You you need to have a more general query that is going to give you some value, but it's not going to give you exact information about this, you know, very small cohort of people that you're trying to target. So that could also help to mitigate some of the issues that we have about things like ad retargeting and the, you know, very detailed dossiers that all of these ad agencies are able to build up about individuals. So it helps to solve a couple of those problems at the same time and definitely be interested to see how that might manifest in the years to come.
[00:53:44] Unknown:
Yeah. I think it would also probably put our industry on a better footing relative to government regulation. Right now, I think it's a shame that we really only offer 2 options. Right? Either you give me all your data and I do whatever I want with it, or you don't give me any data and you don't use my services. And I think that that's a very hard choice. We've seen the laws that try to come in like, I forget the name of it. But whatever the law that makes me click allow all cookies on every single website I visit, every time I visit it. And I I think very few of us ever say, well, I'm not gonna use this website because I had to click that button. And so really it was that a benefit to anyone that we made that change. So I think that if we can get into a world where our incentives are aligned with the data generators incentives, then we will see much more sensible regulation.
[00:54:39] Unknown:
Absolutely. And in your time working in industry and helping to build a lot of the systems that helped to catalyze the broader ecosystem and now being a consumer and integrator of those systems, what are some of the most interesting or innovative or unexpected data processes or data systems that you've encountered or had the privilege to build?
[00:55:01] Unknown:
Well, I think at least 1 of those for me is this notion of data observability, data testability, whatever you wanna call it. The idea that we're going to automate expectations about data and bring machine learning to kind of watch all of our tables and make sure that we know when things are changing. I think that's an innovative 1. The notion of the privacy vault that we talked about, I think that's really innovative and really important that we find a solution to that problem. Obviously, I think a lot of the stuff that we did on Dremel was very innovative at the time. But this industry, it just moves so fast that I think what we think is innovative today, we just end up taking for granted a few years from now. I go back and recruit at universities a lot.
And over 15, 16 years of doing this, it's just been really interesting to observe how much the world has changed and how much things that used to seem like the impossible blocker when I was there. Like network was always going to be a problem for distributed computing. And at some point in the last decade, it just became not a problem anymore. We've got more than enough of it. I have a hard time saying, I guess, what the innovative things are because it's all changing so fast that I will probably leave something out that really was innovative at the time. And now I just think it's commonplace.
[00:56:32] Unknown:
In terms of your experience of building and supporting these data systems, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:56:40] Unknown:
I think the challenging lessons that I've seen are all people and business problems. The technical stuff, 1 of my favorite engineers I worked with at Google had had a habit of saying it's software. We can do literally anything. Anytime someone would would say, well, I don't know if we can do that. He would his rejoinder would be his software. We can do absolutely anything we want. We just have to decide whether we want to do that. So the difficult things for me or the complex challenges are figuring out the structures of organizations, figuring out how we can get people all pushing in the same direction on our data problems and really make it easier to do the right thing than to do the wrong thing.
Because if you make it hard for engineers to produce reliable analytic data or you make it hard for data scientists to consume from the canonical blessed tables, they'll just do something else. And then all your work on the technology side to try and make things good just goes out the window because the human beings are gonna do what they're incentivized to do.
[00:57:53] Unknown:
Are there any other aspects of your work on Dremel and at BigQuery and SoFi or any other forward looking pontifications that you'd like share that we didn't discuss yet that you'd like to cover before we close out the
[00:58:05] Unknown:
show? Yeah. I don't think so. I'm not a real pontificating kind of guy. You need other people for that. I think what I will say is we are at a great time in data analytics. The cost of launching new data focused businesses is as low as I've ever seen it with this competition from Snowflake and BigQuery. The number of tools that are out there is so great that I think we're really at a place where there's a big opportunity for some names to be made here and for a decade from now for us to be talking about the people who revolutionized data governance or data cataloging or data quality or or whatever it is. I'm just really excited to see where that goes. And I'm kind of excited to be now seeing it from the other side of the sideline, being the person who's using it rather than the person who's having to worry about building it.
[00:59:04] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:59:19] Unknown:
The biggest 1 we are seeing right now is that privacy vault that we discussed or some comparable solution. But something that makes handling PII correctly easier than handling it incorrectly. That is the 1 I would love to see people solve.
[00:59:34] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing over the past decade plus and the work that you're doing now. So I appreciate all the time and energy you've put into helping to grow the community and inspire the surrounding ecosystem and now being a consumer of said ecosystem. So thank you again for all of the time and energy you've put into that, and I hope you enjoy the rest of your day. Alright. Thanks, Tobias. Talk to you later.
[01:00:04] Unknown:
For listening. Don't forget to check out our other show, podcast dot in it@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcastdot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Dan Delore: Journey Through the Data Ecosystem
Dan's Current Role and Relationship with the Data Ecosystem
Early Career and Work on the Dremel Project
Impact of Dremel on Data Federation and ELT Paradigm
Challenges and Innovations in Data Joins and Federation
Data Lakehouse and Data Mesh Concepts at SoFi
Modern Data Stack and Data Modeling at SoFi
Data Governance and Ethical Considerations
Privacy Vaults and Tokenization
Data Marketplaces and Future of Data Sharing
Innovative Data Processes and Systems
Challenging Lessons in Data Systems
Conclusion and Final Thoughts