Escaping Analysis Paralysis For Your Data Platform With Data Virtualization

Hello, and welcome to the Data Engineering podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

This week's episode is also sponsored by Data Coral, an AWS native serverless data infrastructure that installs in your VPC.

DataCorel helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure,

meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance.

Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from terabytes to petabytes of analytic data. He started Data Coral with the goal to make SQL the universal data programming language.

Visitdataengineeringpodcast.com/datacoral

today to find out more.

And having all of your logs and event data in 1 place makes your life easier when something breaks unless that something is your Elasticsearch cluster because it's storing too much data.

ChaosSearch frees you from having to worry about data retention, unexpected failures, and expanding operating costs.

They give you a fully managed service to search and analyze all of your logs in s 3 entirely under your control, all for half the cost of running your own Elasticsearch cluster or using a hosted platform.

Try it out for yourself at dataengineeringpodcast.com/chaossearch,

and don't forget to thank them for supporting the show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Upcoming events include the Data Orchestration Summit and Data Council in New York City.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Matthew Baird about AtScale,

a platform for data virtualization and the universal semantic layer. So, Matt, can you start by introducing yourself?

Hi. Thanks, Tobias.

I'm Matthew Baird. I'm 1 of the cofounders,

of AtScale, and I have a a long history in,

in enterprise software and data and analytics. And do you remember how you first got involved in the area of data management?

You you know, it goes back a long way. I'm by training, I'm a math and statistics guy. I grew up and went to school in Waterloo, Ontario, Canada. And I got into enterprise. I don't think anybody grows up and is like, I wanna be a a an astronaut. No. No. No. I wanna be a data and analytics guy, but I ended up in enterprise, and data's really important there.

So my first

programming job, I

was around when the Internet was being,

really used to build web applications.

And I saw the value of building a web application

server that was more data centric

and about pulling data out of databases and and and letting people manipulate the data, do analytics. And and we delivered that as

a as a product very early, I think, in 95,

actually.

And it was excellent because I got to work with all the guys that were, you know, sort of pioneers at that level, exchanging emails with guys like Marc Andreessen and and the team at Netscape as they were building out, you know, that that level of, of infrastructure.

That led to, you know, sort of what I would call the original data companies, which were companies like PeopleSoft and Siebel and Oracle. And I had a career that spanned,

multiple visits to those companies usually through

acquisitions of startups that either started or worked at from early on. I think driving intellect and using data to figure out human behavior is something that really drives me. Personally, I'm very interested in that. It is kinda like the basis of when you look at machine learning and statistics and that stuff, it all leads to this, how will people behave when given this data?

And I worked at a companies in in the startup space that did, things like sales incentive compensation management, which is really about using data to drive

behavior of salespeople. And they're they're really great for a microcosm

of, behavior because they are, as we used to say, they're kinda coin operated. It's very pure. And then I did a bunch of,

marketing analyst companies that were more thought of as as, you know, the the the big data companies like Conductor, which got sold to WeWork and and Yield Software, which got sold to HP.

I took a little left turn into consumer with, Ticketfly and Inflexion. And then at the end of that, I said, you know what? I've been building this platform at all these companies and started doing

a a a a good job, but not productizing it. I should go start a company. And, that's when AtScale took off. And so as you said, you've been having to replicate the same functionality a lot of different places. I'm wondering if you can start by describing a bit about the types of functionality

that at scale enables and a bit about the platform itself and how it fits into the current ecosystem of data tools.

Absolutely. So I come at this from more of an application builder. I was a VP E, CTO type of guy working at startups mostly. And a lot of those challenges

are, let's collect a bunch of data or let people input data or, you know, whatever data comes into the system.

And then I'm going to do some sort of post processing transformation on that data, and I'm gonna sell you back either intellect or the ability for you to derive intellect from that data. That's very common. You know, if you look at, Conductor's a great example.

We crawled the web. We collected a bunch of data. We let you enter some extra data from from your enterprise, and then we turned around and we gave you a u user experience that was all about, let me look at how people are finding my marketing content and then slice and dice it by,

machine type, by browser type, by geo, by you know, these are all business intelligence analytics questions.

And when we built those tools, you know, typically what we do back in the day, you know, whether you use Hadoop or even pre Hadoop, you drop the data somewhere, you post process it, put it into a relational database so you could actually query

it at a speed that was acceptable for building these user experiences on top of. And then you'd manage that whole pipeline, and it was kind of it's a it's a big pain in the ass because

you never could just drop the data in the source system that could also serve it. And along comes Hadoop, and Hadoop could kind of do this or there was some hint that maybe Hadoop could do this. So we started to look at that technology. And what we decided that we would would build would be initially

a big data analytics or a BI solution on top of Hadoop. That's it was that easy. We wanted to replicate

the functionality of best of breed tools like Microsoft SQL Server Analysis Services or any of the other tools that you see the the enterprise,

using, but we would be able to eliminate

or greatly push down

the need to do all the, you know and and some of this is in retrospect, to be honest, Tobias. Eliminate

the painful IT data engineering,

that had to happen to make that, analytics possible. So we delivered that initially, and it's evolved over time. Because we started in Hadoop, and Hadoop is the challenge of getting performance on Hadoop is really 1 of of of data engineering. Right? Like, it's

somewhat query tuning and traditional DBA type stuff for driving performance of a database, but there's a much larger portion of it that's around how is the data laid out on the disk. You know, what are your partitions?

Are you doing sorting? Are you doing,

I mean, all the stuff that you do in Hadoop to drive performance. And that led us

to automating those,

and first identifying scenarios where we needed to do it, and then I,

automating the, the actual data engineering that happens to to enable,

this sort of more, real no. I don't wanna say real time It's not real time, but this more interactive business intelligence type query flow. So what happened over time was we expanded from single data warehouse and Hadoop

to cloud data warehouses because that was a trend that was happening. And then beyond cloud data warehouses, it would be multiple data warehouses, both on prem and in the cloud. Nobody ever moves all their data all at once. They're constantly looking at, you know, maintaining legacy and pushing on the new technologies that are available. So we created a a technology that we could leverage the IP that we built around what we now call autonomous data engineering and move that into

the space of supporting,

multiple databases,

multiple

locations for the data in a way that was transparent to the user. So from a data and engineering data management perspective, AtScale occupies what I think is, like, a new and exciting space around

leveraging, virtualization technology to enable an end result, which is analytics. And I think we're the only company that's really doing what I would call autonomous data engineering for analytics. There are some traditional

virtualization vendors that focus on federated queries

and caching. Those are absolutely necessary, and they're they are strategies that need to be, implemented, but they're 1 of maybe 2 or 3 dozen different things that that we do to enable our core value proposition, which we've really simplified this down. We're interested in performance,

security, and agility.

And then in the cloud, cost savings is a very real thing because

I think the the 2nd month after you get to the cloud, your CFO usually gives you a call and tells you you're spending too much money.

So so, sorry. I'm talking a lot, but,

it what it how it manifests itself is we built a virtual data warehouse, which means

we look like we're a database, but we're not a database. And it can connect to any data anywhere

and present it as 1 universal semantic layer, and we'll talk about what that means. And then you can query it from all your tools. So at the end of the day, you have this 1 data service,

and you can get all you can fulfill all your data needs from that. And we worry about the intricacies of scaling that up from a, concurrency and performance perspective. Now I think universal semantic layer gets a lot of airtime. And just to be clear on what that means exactly,

there are strings and numbers and databases. We all get that. And then in the real world, there are business and real world constructs like a customer or a hierarchy or or a dimension of time. Those things have some analogies in this in the low level type system in the database,

but what we wanna do is uplevel that to present it in a way that it makes more sense to people that consume data that aren't necessarily

experts. So for instance, I wanna drill from country

down to state, down to city, down to specific ZIP code. That's a hierarchy. That's a thing that exists in in, in the real world that you can express in a database,

but it's not super easy to write that query. There's a lot of group buys, joins, all the other stuff that happens. It gets even harder if you're doing it across multiple databases. So we present that that semantic layer to the user so they can they can do the analysis they want without having to worry about the complexities.

The end of the day, abstraction is a wonderful tool for solving problems even for business people who don't necessarily understand the concepts around abstraction and why it's valuable. You encapsulate complexity,

are partially covered by some of the different open source platforms out there, such as some of the SQL query engines, such along the lines of Presto

or metadata

management at some level for being able to identify

what data do I have, sort of the data catalog aspect of things. And

all those are useful tools in and of themselves. But as we said, there's a lot of extra engineering time that needs to be dedicated to just making sure that those are running, that they're able to perform at the scale that you need, particularly as your volumes of data grow or as the datasets change and making sure that the on disk aspects of the data are optimized for those different query engines. So I'm curious if you can talk a bit more about some of the types of tools that somebody might be using in an existing data stack that they've built themselves that they might be able to replace by moving to at scale? Absolutely. You're right. There's a ton of open source out there, and I've been involved in open source for over 2 decades as a a member of the Apache group as a PMC for db.apache.org

and contributed tons tons of code and tons of my personal time to open source, and I love it. We've contributed open source at Atscale as well. We made a decision that we weren't going to open source necessarily this the full platform that we have, and that's more of a business decision. That's a different business model.

But the technologies that I think people are using today, let's let's think about that. There you know, clearly,

we are not in the

landing data

space. You set up those data pipelines

independently,

and we are we come into play after the data lives in a enterprise data warehouse, multiple enterprise data warehouses, or a data lake as the newer construct for that. And then

you have,

open source technologies that deal with

getting that data to the end user who's using a Tableau,

using a Power BI, maybe using an Excel,

perhaps consuming it via,

you know, JDBC or ODBC and building a platform around it. That there's there's a ton of technologies around that. Sometimes,

in fact, there's none, and it's just very raw. Let's let's put Impala here or let's put Presto as you said. Those technologies are wonderful, and we leverage those and we support those. We make them better. It's like a chocolate and peanut butter thing, I think,

people would say. You have the ability

to accelerate and scale Hadoop in a way that

is

not involve having to

hire

and retain very expensive and hard to find Hadoop folks. And it's easy to set up Presto. I get that. It's very easy to set up Cloudera,

Impala,

and have that work on Hadoop. It's not as easy to get it to work in a way that scales. Like, for instance, at 1 of our customers is a very large credit card company to scale to multiple petabytes,

hundreds of thousands of people accessing it. That requires some really smart folks, and they have really smart folks over at that company. They're just they have a lot of things to do. So, you know, if you can automate some of that, that's the part where I think we've done a really unique job of delivering a solution that hits the market where there isn't any open source necessarily to do this around

automating the data engineering tasks necessary

to deliver, you know, a production view of data in a in a way that's, that scales. But very specifically, let's let's think about this. You know, there is what is that 1 from? Apache has something called Kylin. Right? Yeah. Apache Kylin, Apache Druid. Those are 2 technologies that I think are,

Apache Kylin's probably more on the OLAP side, Druid, more of an aggregation engine. Those are

interesting technologies. They're super rough around the edges to be clear from I mean, that's it's not a technical

assertion. That's more of my opinion. They do not wrap an end to end solution, so you're constantly piecing together a bunch of things into the platform. And then you still have that that issue of you've gotta do the data engineering yourself, and there's there's no real I don't think there's anything in the, in the open source that does that today. And can you talk a bit about the actual technical implementation of at scale and and how it's architected,

and some of the ways that it has evolved since you first began working on it, particularly given the shift in where people are spending a lot of their time and focus on, 1, how they're treating their data and storing it, and 2, the types of platforms that they wanna be able to integrate with. Absolutely. So we

we we started the company

6 years ago,

as of September.

And when we sat down, we talked about, you know, like, what would be the stack that we would we would use. Clearly, on the front end, you're like, you know, JavaScript, HTML, CSS. That's kind of that's not you don't even need a decision there. That's just the way it works. Whether you use backbone or React, you know, wait a couple weeks and the JavaScript stack will change. So we just picked the best thing that we could, and we architected the front end really, really well to adapt to that. In the middle tier, we decided that what we would build would be an API server,

and we did not want to hire,

a whole bunch of engineers to work on the middle tier. We wanted to have a high leverage language that we could use.

And I had been working with Go for for,

you know, probably 6 months beforehand, and I was very encouraged

by

how you could use go.

You could go away, you could come back, and you could look at the code, and literally, there was no friction for just stepping into that code and and maintaining it. That's been, I'd say, an incredibly successful solution, and I know that, you know, there's never any language that universally everybody loves. But in terms of having a very small number of people maintaining a large code base

and being able to hire onto that code base and maintain it, you know, almost in a part time way, it's been super successful.

The performance has

been fantastic.

And, honestly, you know, in the enterprise, some people said, oh, if it's not on the JVM, nobody's ever gonna let you run it. Yeah. Not true. It's it's been fine. It's been adopted there. That's for the middle tier in the API services. And then on the back end, I worked with 1 of my cofounders, Sarah Gerwick, who is a huge fan and a leader in the functional programming space. And while I think maybe if we

all the, restrictions, she might have gone with something more much more exotic,

but the choice was it's gotta be on the JVM. It's gotta be something that works well with what the big data tools are familiar with. So at that point in time, it was you know, the big thing was Spark was starting to take off. Hadoop was definitely a thing. And we really wanted it to be functional so that we could deal with

more naturally with with data in, in a scalable way. So we chose Scala, which runs on the JVM, of course. And

it's been

a good experience.

The language is highly leveraged. It's a

not a you don't have to have a huge team of experts. You gotta have some smart folks that understand the concepts.

And I think, honestly, it's it's been good for attracting the sorts of engineers that we like that wanna do interesting tough problem. I know that there's play for Scala and everything else, but I don't I think if you're probably if you're using Scala to build web applications, you're missing the point. It's much more powerful as this back end

language for, doing heavy algorithmic work, which is what we're doing. So from an architecture from a language perspective, that's what we're doing. From an architecture perspective, you know, the front end now is all in React.

We got rid of the a lot of the backbone stuff. We've been, you know, you you always, as I said, are progressing that JavaScript stack.

What we built on the front end is a Google Docs like

multi user concurrent

application for designing data models and data services. And that we used web sockets, and we used a sub protocol of HTTP called the web application messaging

protocol to do this pub sub type thing. Worked out really nicely. I'm happy the architecture scaled,

it's low bandwidth.

You know, because, fundamentally, what we're doing is we're concurrently working on a thing that isn't a check-in, checkout, or a merge.

It's not you you don't wanna have merges and forks and everything right in the face of the users. They're not used to that. So while we do support things like versioning the arc, the artifacts that we're building in Git, you don't you aren't ever presented with a a Git merge that you have to go through. It's it's it's uses,

MVCC,

multi version concurrency

to, you know,

order the messages and then, and then bring them together in that architecture. So that's on the front end. The middle tier, pretty straightforward,

restful interfaces,

JWT,

for security couriers that we can talk to,

so the the front end can talk to both the engine, which is running in Scala, and the middle tier, which is running in Go. And then on the back end, what we built you know, fundamentally, we built an optimizer. It's like a database optimizer,

and it understands

more the nature of the data that's distributed across multiple infrastructures.

And every piece of technology

in that chain, whether it's

network and in fact, it it transcends technology, and it has concepts around cost.

Because when you are trying to optimize a query

across multiple databases,

it's no longer just that you have the optimization objective be pure speed.

You wanna create a scenario where you can optimize

for other outcomes and then blend them together so that

whoever's running this data infrastructure

can get what they need out of it. And whether that's pure speed, you can do that at the expense of cost. Right? I can I have elastic compute resources? I could scale things up, but you have to have that, that nature of is this data collocated? Is it going over a low latency

network? Is it

is it gonna cost me an arm and a leg to serve this out of, you know, BigQuery or 1 of the per resource basis?

So a little bit more about it. The design center builds the virtual schema. The engine runs them. It's sort of a it it is a

very much a software

project derived metaphor. You build, you deploy,

and you test, and you and you run it.

That's for for these data services that we're dynamically generating. We used to be on Hadoop, as I said. It used to run on a Hadoop node. Now we have no

tie to Hadoop at all. It's we run on any image, any VM, any cloud. We're not tied to Hadoop. We can talk to,

every cloud database out there. Yeah. And that's is there anything else in the on the architecture side that might be interesting to your listeners?

No. I think that's a a good overview. And I'm also

integrates with Hadoop. And I'm curious what you have seen particularly in recent years as far as what the breakdown is of people leaning towards data warehouses or data lakes and particularly with the advent of cloud data warehouses that are starting to blur those lines? That that's,

a really good insight. I think that we're starting to see,

you know,

I don't I don't think the data lake

definition is nearly as, crisp and concise to say what a EDW, definition is. And I do think that the cloud,

Typically, we typically, we thought of a data lake as, I guess, traditionally would be. This is where the data lands, and it's unfiltered, and it's raw, and it's ready for doing whatever you're gonna do to consume on it. But typically, that involved

doing some ETL, pushing it into a different database or a different technology.

And then EDWs were the filtered, curated, ready to go. This is blessed. Let's let's go have it. The nature of data has changed in in in the enterprise. People are, I'd say, the forward thinking companies are more interested in getting that data in a

perhaps less filtered they'd love to have it more filtered and more concise, but they think that the speed at which they can get the data to the users that they and the agility that those users can consume the data is more important

than necessarily providing this beautifully manicured data experience.

And you see that in,

the spaces of

data of machine learning and and things like that. But those cloud data warehouses

have the capability and have a cost model that's

not prohibitive for storing tons and tons of data, so they can serve both purposes.

And then that means that they can serve

all the different constituents and all the different consumption patterns.

As I said, whether it's business analytics,

operational databases, KPIs,

ad hoc, machine learning, or whatever other things are coming along along that path. I think I I I I've I've said a bit. Is there anything in there that's interesting that you'd like me to drill in on? No. I think that's good. And for somebody who's using atscale,

particularly if they already have an existing

set of infrastructure or even for people who are coming from greenfield. I'm wondering if you can talk through some of the workflow

of

getting AtScale set up and integrated with their systems. And then for somebody who's using AtScale for doing analytics, what that looks like. So So we have a combination. We we serve the global 2000,

you know, people with lots of data, lot probably lots of databases.

Some of them are newer companies like Wayfair, I'd say, is 1 of the newer companies they've come up in the last 5 years, and they have a

very forward thinking attitude towards data. And I I think if if you can go and read about what the guys at Wayfair are saying about data publicly, they're doing a lot of things

really, really smart, really pragmatic. And,

while it is it is new school and cutting edge, it does represent where I think that the market's going. And that is,

let's give them 1 place to consume data. Let's give them tools to find the data, and let's give them an infrastructure that makes it so that when they start to use that data, whether at small scale or large scale, we can grow with the consumption of that data

and make it work for you know, they have thousands of people

consuming data. And their attitude is everybody should have access to all data except in the case where it needs to be governed or there's a compliance issue, in which case they apply that policy

at that single data,

service entry point. So for them, to answer your question, I I just

my, chief product officer always says, when you when somebody asks me what time it is, I tell them how to build a watch.

I feel like for engineering, you know, the implicit question is always, how does this work and and what are all the the edge cases and and contextualize it. So

for

those sorts of,

sorts of forward thinking organizations where data is considered to be like air and should be available to everybody,

it's a install it. It's a single RPM install. You model some data. So you you register your data warehouses. You model. You build a virtual data warehouse.

You deploy it, and you just start querying. And that's it. It's very, very easy. Now you have to have access to the data warehouse, which

some companies, you know, whether if you're a Bank of America or JPMC,

god bless you for having controls on that because I'd be worried if you didn't. So you have to get access to those data warehouses, and in some cases, that has to be secure. And Kerberos,

you know, is challenging.

It's probably always going to be challenging. It's a tough system to work with, but it's what a lot of enterprises use. So you gotta go through that part of the the scenario. And then getting better at using at scale is about being able to iterate

and

put something in the hand. If you're an IT group or you're a data

sort of producer type person, you build something, you give it to people to consume, You gather the feedback. You iterate. And we're really good at that because once you're in in the role

of,

of serving people that that consume data, what you find out is they're always changing their mind on what they want and when they when they want it is now. So being able to

model something and hit a button and deploy it, and it's the service instantly flips over and represents those changes is

hugely powerful.

Because in the

past, those consumers

of data have been like, okay. I gotta fill out a Jira ticket.

I wait, you know, anywhere from a week to a month to hear back on the status of the ticket, then a software project happens.

And then I get the new data element because they had to build a data pipeline, they had to do ETL blah blah blah all the way down the line. They don't want that. And I would argue that data engineering folks don't wanna be in that space either. There's a lot of interesting data engineering tasks out there, and not a lot of them use,

the 3 letter word ETL.

There's a lot more interesting things to do. So we handle that for them from an analytics perspective so that they can go and and that workflow for implementing ad skills, very simple, and the iteration process and the maintenance ongoing is is very easy. And what are some of the

main challenges

that people are facing in their sort of,

traditional,

quote, unquote,

data engineering workflows, particularly if they've got a strong focus on ETL that they can just

hand off to at scale once they implement it? And what are some of the ways that you have seen people use their time that gets freed up by virtue of not having to do those types of tasks anymore? So think of this like,

you remember when,

Salesforce came out when their I I forget when, but their motto was no software.

There was software.

And when we say there's no ETL, there's still ETL. It's a matter of what part is automated versus what part falls outside of the purview of, you know, the analytics use case. You still have to be able to

write,

data pipelines and whatever tool you you end up using, whether it's open source or commercial for landing the data.

You still have to

do the work of potentially

doing some data wrangling or some data profiling if you need to improve or augment that data in in a significant way. Once you've got it

to the point at which

you think that there's a,

this is

a a good valuable dataset, and potentially it lives with a bunch of other datasets in in a warehouse.

Where we can help

is

the users can self serve on and we have customers actually doing this. I didn't actually expect that there would be a user that, or or a customer that would roll out our design tools to users so that they could design

and

build,

their own data warehouses

and have them be all without involving IT at all. But if you think of it, it's kinda almost like what MuleSoft does for APIs, but we do it for data. So we make it easy to do it.

We give them some some automation capabilities,

and then we roll it out. I think I

went off, and I think I've lost track of what the core of the question was, Tobias. I'm really sorry. I was just trying to get an idea of some of the ways that people are taking advantage of the extra time that they get that is

freed up by virtue of not having to do as much ETL

once they start using at scale.

Alright. Well, that depends, I think, on the organization. I think that you would be best served

to have your data engineering team

working on

building out what some people call the real time enterprise or streaming and figuring out how to improve the latency

and the quality of the data as it pertains to,

collection,

in in all the ingest stuff. And then, potentially,

I think

that has much higher yield because garbage in, garbage out, faster data, you know, faster is better. The streaming use cases, nobody's really figured that out yet, at a a a sort of a enterprise scale.

So that's a great place to spend your data engineering time.

There are you know, I read this statistic somewhere, and I wish I can remember where so I could attribute it. But I went back and and checked, and it's

true. There are 75100

job openings for data engineers in San Francisco.

There are 74100

people

with the title

data engineer

in LinkedIn for the US.

So we have more demand for data engineering skills in San Francisco than we have supply in the US.

I'm sure they're gonna find something to use those,

data engineers to do if they don't have to,

go and get this data element and make it available in Tableau for, you know, for the marketing group. That's

that should be automated. I think we can all agree things that should be automated

or can be automated probably should be automated.

That's an ethos that that we really believe in, and and it's not about taking away jobs at data engineering. In fact, it's about making data engineers

much more happy in the work that they

do. Look at what they're doing. If you join data engineering teams at

at Facebook or

or,

or Amazon or Google or any of the big cloud vendors, those are really interesting challenges.

Making a data element available for an end user is super high value,

from a business perspective, but not a fun engineering challenge.

And in terms of challenges that you have been faced with in the process of building and growing the at scale platform, I'm curious, what have been some of the most interesting or unexpected ones that you've had to overcome and some of the most useful lessons that you've learned in the process?

Hire I think hiring

and finding the right people to work on

the types of problems that we work on that are extremely

algorithmic,

in nature. We

every single thing needs to be,

scalable to multi petabytes. You know?

On the engine side of things, which is the Scala based,

software that we develop, it's it's all about getting the right person in there and then getting them up to speed on, on, you know, essentially, how databases are built. Even though we're not a database, we're kind of doing database like development, but even harder because we have to support

all the databases out there.

I think that the

heuristics

there's there's

unsolvable problems that we have to approximate with kind of, like, heuristics and and identification

of of of scenarios. So let let's take an example of a kind of problem that that we solve that's

turns out to be really hard. Doing

pre aggregating and and doing a preprocess step

on multiple petabytes of data

takes a really long time. And if you wanna compute every possible combination that somebody could could and every join path, it it it'll blow out your data massively. Right? So think of it as this is the old school of how OLAP worked. We'd pre

calculate all the combinatorics,

and we'd roll it out. My cofounder who was at Yahoo had a 2 week build time. Some of our customers

have multi days. And in today's enterprise, in today's, you know, consumption of of data patterns, like, days, you're you're already dead. You can't you can't be looking at 2 days ago when you're running flash sales or, you know, you have, you're spending $1, 000, 000 a day on some marketing program. You need to have up to date data. So that wasn't an option. But, also, not doing anything up front me means that as soon as people start to

query the data, you have

no information and you have no acceleration.

So the queries take minutes, hours potentially. And there's a fine line where you're doing something that's smart, that's predictive,

but you're doing it before you get all those traditional signals that you'd get from having the wealth of, of queries coming in. Like, queries are the natural

expression of, of interest from the end user community. Right? Like, they'll tell you what they want if you go and gather requirements and talk to them face to face, but what they really

want is represented by how they query the data. So we have to make this decision, and this is this is there's a major challenge in in building up things that that I think have never been built before in figuring out when do you have enough data and how do you deal with,

a small amount of data but still give a good user experience to people querying. That's been, probably the 2nd biggest challenge

after hiring. I mean, of course, you need those people that are smart enough to to do that type of engineering.

Yeah. I think those 2 those 2 are things that that I think about from a technical perspective.

From a

business perspective, I think virtualization

has been,

in the past. Traditional virtualization has has failed. It it's it's kind of it's become a word that's associated with a specific implementation,

a traditional implementation,

and we've moved past that. And you you're a specialist in data engineering. You know that, like, what, 15 years ago, there wasn't really a a a I don't when when did data engineering become a term that we used and it became a career? How how long is that? A decade?

I think it's within the last 5 or 6 years that it's really become

a discipline in and of itself. I mean, it's a set of responsibilities

that have been around since we started dealing with computers.

But in terms of an actual distinct role, it's basically it postdates the, the introduction of data science as the sort of new and interesting career path because then people are all companies are spending all this money on data scientists and realizing that the majority of their time was actually spent doing all of this, you know, collection and preparation and cleaning work that wasn't actually

providing the end value that they were expecting, and so that's when they started breaking it out into its own discipline. I see you've done this for real.

Yes. Data engineering enables

anything that has to do with data in your in in your company.

And those folks,

like, it didn't exist. When you put a name and you create a career around something, it has a way of sort of forcing the codification, the the definition of what all those things are. You know? Like you said, we were doing some of those things. In some cases, it was tribal knowledge. It was best. It was split across multiple different types of people in the organization. But once we created a career path around data engineering, things started to crisp up. And that's what enabled us to look at it and say, these are the things that are gonna require humans. These are the things that probably a machine's going to end up doing. That's inevitable.

And then, you know, going through that process of figuring out how to identify

those scenarios

and build them out, that was, you know, so I'd probably put that in 3rd place for the challenge

of building, of building the the company, and that represents a business challenge as well.

I think, as I said, traditional virtualization

is based on predates

data engineering. And even some companies that came along,

that maybe didn't start in the same way that we did and look at the problem the same way went back to virtualization as being a federated query and caching issue and not an automation of data engineering tasks problem. So educating the market and getting getting them to understand that, you know, I know you've heard this term, but it doesn't mean what you think it means, and it doesn't carry the baggage that you think it means has been a a big challenge.

Not a big challenge. I mean, people see it. We show them. They get it. But I think from the perspective, sometimes,

language gets overloaded. That's that's challenging. So we've experimented with creating new terminologies for this, but at the end of the day, people like sometimes they actually like the comfort of hearing language that they've heard before. What are some of the most interesting or unexpected ways that you've seen people using at scale

and some of the

misconceptions that they might have going into it that they are pleasantly surprised by when they

performance,

and and performance

is a term that means a lot of things. Like, can a single query come back in a reasonable time frame? Can you scale queries to whatever the size of consumption pattern is? You know, so handling concurrency, handling scale, handling per query performance is what performance is about. Security, agility, and cost savings. And I think the 2 places where, you know, security is sort of table stakes, the areas where people are surprised

are how do we drive performance in a way that's completely hands off?

How easy is it to build,

and deploy these data services?

I literally I think I drive the salespeople in my company crazy because I'll I'll go completely off road, and we can build a model and deploy a model in a in a meeting

and not do, like, a pre canned demo. And then the cost savings. The cost savings are something that we actually probably

didn't really think about as much because you typically don't lead with with that sort of, like, costs. It's a it's a race to the bottom in traditional enterprise, but in cloud, it's been a it's been a big deal. We see some very large reductions

in cost.

And the surprising thing there

in the cloud is I don't think the cloud vendors mind that we say we're gonna save you money on a BigQuery. We're gonna save you money on a amazon on a Redshift. We're gonna save you money on a Snowflake. They're looking at the longer game of,

if I can improve

the unit economics of analysis and make this platform more cost effective

from an ROI perspective,

you don't

constrain what you're doing and say, great. I came in under budget.

You maximize the use of that technology, and you create more use cases. You create more

opportunities for people to come and use the data because you know that you have a scalable,

cost model around it.

So guys like Google and guys like Amazon have been, you know, very

they've been open about it. And, you know, we take a company like, Home Depot's

a a wonderful customer of ours, and

we did very specific things

in our implementation for BigQuery that solve

the I mean, you can write a query that costs a lot of money, and you can express that query a different way, and it costs significantly less. Say, a join which includes which

increases the complexity of the query and the slot utilization in BigQuery versus enlisting it or or doing another strategy, We can do all that stuff in an automated fashion with query rewrite,

and we get them to the point where initially they thought they'd have

maybe a couple 100 people

on the all you can eat 2, 000 slot program. But, really, now they're able to to get that into the multiple, like, 5, 6000

people consuming

and getting the same exact answers

at the same price, but they get that in an automated fashion. From a use case scenario, there's 3 that come to mind that are that were surprising to me, and maybe this is more because I'm a technologist and not necessarily a business person. But the,

AbbVie

gave the design center

they thought that they use,

the the design center was so usable and such a great experience that they gave it to their analysts and said, here. Here's the data. Here's the tool.

Design your own use cases. Roll them out, Quarium. You are entirely in charge of a production level self-service

initiative. That was you know, I I I'm very proud of the work that we did to make a high,

user experience product, but, I didn't think that it would be 1 where they'd give it to people that not weren't necessarily data modelers. The Home Depot, we talked about them already, but, they have a massive

spreadsheet that's super important for the business, goes all the way to the top of the CEO.

And we I know everybody disparages

Excel. I don't. I think it's fantastic. It's a 2 it's probably responsible for creating more programmers than any language in the world.

Specifically, everybody's a little, programmer y when they when they get into Excel. And that tool,

is so important that, you know, all the stores have it. We were creating a use case that that is sort of suppliers and internal folks.

Fantastic use case, but the scope of exposure of that data to people across, the extended Home Depot family was amazing. And I do like the v I really like the Visa,

use case. It's not surprising. It's actually more satisfying

me, because the reason that I started the company was to build an addressable

programmable

platform that wasn't necessarily just going to be people on Tableau consuming it. It's it was about people building businesses

and building applications on top of it, and that's what they did. And they built a I think there's multiple 100 of 1000 of their customers

access

trillions

of rows of of data on a giant probably, I mean, probably the best and biggest use of Hadoop in the world that I know of. It's it's a completely valid use of Hadoop. I know it's popular to hate on Hadoop, but it's wonderful for them. They have to keep their data behind the firewall, and they have so much of it

that really

nothing else will work. So that was really interesting. My cofounder, as I said, is a more of a pure BI guy. So, you know, having people connect Excel and Tableau and Power BI and inquiry all day is is fascinating. But I wanted to enable

the developers

that have had to just gyrate wildly to get

software built, that that accesses large amounts of data. Now they can do it, I'd say, relatively straightforward with the combination of virtualization

and

and the, automated data engineering that we do. And by virtue of the fact that they're accessing the underlying data through this abstraction layer that handles some of the optimization

aspects, it also helps in terms

of future proofing and reducing some of the risk of experimenting with new tools and platforms because

you don't have to reimplement any of the client side code. You do you can just add in a new data store, test out to see if it does what you want and if it has the sort of cost and performance

patterns that you're looking for. And then if it doesn't work, you can just take it out again without having to do a whole bunch of retooling and reengineering

of the rest of your stack. This guy programs.

You know what it's it's like. You

you you,

it's the first thing you do if you if you don't know what the implementation's going to be on the back end. You create an abstraction in Java. It's an interface.

Golang has, you know, more of a duck based,

contract system.

You you build that abstraction,

you test it out to make sure it works, and then you're given the freedom

to change the implementation without having all the client code have to change. Change. And in this case, client code is,

you know, quite frankly, it's humans.

Humans are the hardest code to change. They,

they get stuck in their ways. They figure out a way to do something, and they wanna do it that way forever. So giving that freedom to the IT and the data engineering team, I think you nailed it. That is

life changing. You wanna move a single table? You wanna move all the tables?

You wanna go from on prem to off prem? We've effectively created a distraction that gives you a software switch

for controlling where the data is and how it's accessed,

which gives freedom. And when is the at scale platform the wrong choice? Well, we're not an OLTP

platform. We don't handle what I would call, you know, the sort of, like,

every bit of the SQL spec. We are an we are a multidimensional

analytics platform. So for analytics use cases, it's good. If you wanna talk about creating a data service that does sort of both sides, right now, we're not

that that product.

More

focused in on should you be a customer or not,

we deal with you may have a large amount of data. Like, you may have big data

in a single store. You may have big data in aggregate, and we find that's much more reasonable is that every company's got big data. It's just in 50 to a 100

to a 1000 databases.

But if you have a single data warehouse and you have small data, you know, that's probably

just throw Tableau at it or, you know, write

a write a web app on top of it, and that's pretty straightforward. You don't need to to to buy at scale at that point. And what do you have planned for future iterations

of the at scale platform and business, either in terms of feature improvements

or new product areas? We're gonna keep it simple and focus on performance security,

agility, and cost. And within that space,

build a platform that solves for what I refer to as a governed self,

self-service environment.

And that means that users need to be able to discover the data. So that's cataloging, metadata management.

They need to be able to apply policy

in a centralized place so they can decentralize access, and that concept is really powerful when you think about it. And that

presupposes

that you have sort of a natural path to data that flows through 1 place where policy can be enforced.

And

let's see. You know, we are focused now on on global 2, 000, and we've implemented a lot of things. And this this really you know, you you asked a little while ago about open source and replacing open source. 1 thing where open source,

and I love it, but they don't know all the situations

that around security that that happen in in an enterprise. So we're we've built out a lot of features and functionality to have the best security story of any company out there. And we owned big data, or we own big data for the Global 2, 000. We gotta go into the mid market, and we gotta focus on mid market and building,

products and services that are gonna help those companies. Because they have big data problems, and they have data everywhere problems,

just like the big guys do. Are there any other aspects of the at scale platform and the work that you're doing there or the ideas around data virtualization

or data engineering automation that we didn't discuss yet that you'd like to cover before we close out the show? Jeez. You know, I I think there is

the industry's changing, you know, the whole divergence and convergence model when you're solving problems.

As I apply that to industries,

we are still diverging, and the data engineering is growing. It's still being defined. We haven't reached the apex and started to to to bring that together to make a very concise, these are the technologies,

these are the activities,

these are the kinds of people that are involved in it. So it's it's exciting, but it's also

it's a big challenge.

We're gonna have to keep up to date with what the best practices are

and translate those into

what our product does.

Should we be driving tools like,

Beam or any, you know, any of those data movement tools?

Probably. Absolutely.

Is streaming going to become a bigger issue?

I

100%

believe streaming is going to be

a challenge for enabling,

the kind of consumption of data that enterprises want over the next decade.

And, frankly,

the whole tool chain end to end

is not ready for it. Even if you look at things like the traditional BI tools, they don't have

any way to really work with streaming tool, streaming data.

There are

point solutions here and there, and some people have gone out and started startups that do it. But for the majority of consumption,

that pipeline of ingest through the

through to, business analyst,

it doesn't exist. So we're gonna have to see those those areas change. We're gonna have to keep up to date with that.

I think that,

so that streaming,

the

you know, I'd be

a bad CTO if I didn't say the the ML phrase, but I do think this is actually a space where

machine learning

and advanced statistics are are are going to be

a major

improvement.

We like to think that virtualization gives us,

and it's not just virtualization across multiple data warehouses, by the way. It's you virtualize

the a column itself. So it's not necessarily a nominal value. It's a computed value. You have databases that now support pushing machine learning down to the data, and there's ways to do it. But we have to expose that to

the end users in a way where they don't have to be mathematicians,

but they can get the benefit of of having,

of having that sort of experience with data where it's

more helpful.

And I hope it's not gonna turn out like and I'm gonna date myself now. Like, the little Clippy, the remember Clippy? Oh, I remember Clippy. I think everybody who has ever experienced it remembers it.

Yeah. Yeah. Is it

they were you know, you think about it. Microsoft was way ahead of the game on building a digital assistant.

The problem was that everybody hated him. He's a nice guy. He tried. But I think that those sorts of things are becoming more pervasive in a way that it has to reach people that are doing analytics,

at companies because, otherwise, they're going to be limited in what the amount of analysis that they can do is. It's it's creeping in with things like

NLP,

but there's other places where it would be super easy just to do

a linear regression and show you when you're looking at a bucket of data that there's an outlier in there. I don't think anybody's really doing stuff that's that even

that easy right now, but we'll get there. I learned this. I just I I had a Pixel for ever since or not a Pixel. I had Android ever since, cell phones came out,

and I transferred over to Apple. And while the Apple device is is beautiful and I love it, I never realized how much

the machine learning from Google

in that device was helping me with my day to day. And I'm a technologist, and I, you know, I

was obviously aware of it because I was using it every day, but it didn't strike me as being this game changer because it's it's being

evolved into the workflows. We're becoming more and more used to it. The predictive stuff, not just on the text and the speech recognition, but in the applications,

plugging it into my car, and it's

anticipating where I'm going to go. Those sorts of things, they just it's a very natural flow. That has to happen, and we have to be there as a company, as a AtScale,

to enable that that flow of of information and creating the intellect from the data. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I'm I'm super biased, Tobias.

But I

I do believe

that the virtual virtualization model

can

work, and it will

be

the way that people build these,

these single data services, and it will be the way that people get broad adoption of a secure governed self-service analytics solution for all use cases.

And that's

that gap. I think we're in the lead. I think we have a new approach.

I don't think we're we're not there yet, but we are the furthest along, and and I think we have the best vision for doing

it because and then that has to be automated because the challenge is really that, and the gap is that this hiring won't scale. Data engineers are going to it's gonna be, like, the hottest

continue to be the hottest

and probably highest paid

software

career for I I just don't see an end to that. Alright. Well, thank you very much for taking the time today to join me and discuss your work on the at scale platform. It's definitely an interesting piece of technology and 1 that solves a necessary evil in the data management space. So thank you for all of your work on that front, and I hope you enjoy the rest of your day. Thank you, Tobias. Enjoy this.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links