Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset?

Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand,

and they started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code,

Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/atlan,

that's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3,000

on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Mark Cusack about Yellowbrick, a data warehouse designed

for distributed clouds. So, Marc, can you start by introducing yourself? Hi there, Tobias. Very nice to be here with you today. Yeah. My name is Marc Cusack. I'm the chief technology officer at Yellowbrick.

I've been with Yellowbrick for around about the last 7 months. And before joining as CTO, I was an exec

at Teradata

where I led the product management team responsible for the data warehouse software and the advanced analytics machine learning portfolio there. And I was at Teradata for around 6 years, and I pretty much have owned at 1 point all of the product portfolio

at Teradata. But I came to Teradata actually by way of an acquisition they made in 2014

for a startup called RainStore, which was in the data

warehouse archiving market. We built that up from scratch over about 10 years. So I've been in amongst databases and data warehousing for a big chunk of my career. And do you remember how you first got involved in the area of data management?

I remember back to the dim distant past. This actually came out of some work

in the UK Ministry of Defense, and I was a research scientist there. And we got grants to research various aspects of how technology could help military training

and

development. And 1 of the interesting use cases they had was around

how do you manage the data collected from large scale

field exercises.

You can imagine kind of fleets of tanks roaming across the plains, and they would stop these training exercises,

And they'd have a whole bunch of data they gathered from all of these military systems, and they wanted to do what if analysis in the field.

And as you can imagine, back in the early 2000,

the technology that we had for storing, processing,

and running analytics was pretty primitive. The footprint of the hardware was small. And so we had the bright idea of coming up with a a compressed in memory database technology that we could actually deploy in the field.

And so it turned out that this technology was pretty useful in other applications, and so we ended up spinning that out to form Rainstorm, the the startup company that that grew into this data warehouse and archiving technology. So it was a spin out to some military IP

into the commercial world and a pretty successful 1. As you mentioned, you recently joined the team at Yellowbrick. So I'm wondering if you can just start by giving a bit of an overview about what the Yellowbrick project is and some of the story behind it and your motivation for joining the team there.

So Yellow Brick is a modern data warehouse for the hybrid and distributed clouds, and we can break those down to what we really mean into the constituent parts by that. But by modern data warehouse, we've built a data warehouse, a database technology from the ground up using modern software development technologies

and having a thorough understanding

of the hardware platform that we're working on as well. And, hopefully, we'll get an opportunity to talk in more detail about that later.

But the yellow brick is really interesting because

Neil Carson, the CEO,

Jim Dawson, and Mark Briddinkham were the founders, and they were at Fusion. Io, a storage company, a flash storage company.

Back in around sort of the 2,010, 2013

period,

SSDs were beginning to become really interesting from a data warehousing perspective, and people traditional companies like Teradata or Oracle were thinking, well, how can we replace our spinning disks with

SSDs? Will we get around some of the performance and bandwidth bottlenecks of those technologies, which are traditionally where the the bottlenecks in data warehousing analytics reside. And so they went to companies like Fusion. Io. I said, well, can we take those SSDs, shove it in our our data warehouse platforms, and get a benefit from that? When these companies did that, they found they didn't get a benefit. What they've done is move the bottleneck

elsewhere up into

DRAM, up into main memory.

And so Neil and the team had an idea of, well, what if we could take some of the ideas of how we

route data from SSDs directly

into the CPUs and bypass main memory? Maybe

we could come up with a database technology that would cut out that middleman, cut cut out that bottleneck, and that's exactly what they did. Yellow Brick rather was founded in 2014.

It came out of stealth mode in 2017.

We've acquired quite an array of enterprise customers now. We've raised a $173,000,000

through Venture Capital, and we're having a fantastic year.

In terms of the actual distributed cloud terminology, you mentioned sort of hybrid cloud, distributed cloud, and, you know, somewhat the same breath. I'm wondering if you can just talk a bit about sort of the challenges that are associated with working in, you know, a distributed cloud environment and what that really means in terms of the actual deployment topology and some of the types of environments where that kind of requirement will come up. Yeah. Well, I think Gartner actually coined the phrase distributed cloud, and they also have a a rather pithy way of saying that distributed cloud fixes what hybrid cloud breaks.

And to just to define what a distributed cloud is, it's the idea of taking the public cloud hardware and software stacks in AWS and Azure and Google Cloud, for example,

but locating them elsewhere. So in private clouds, in customers' own data centers, or even at the network edge. And so now you have a common unified set of hardware and software and APIs

on which to deploy

higher level services. Right?

Another defining feature of distributed cloud is the idea you have a single control plane across all of these different

locations where you're deploying cloud services

to control all these things. So now you've got a homogeneous foundation,

single pane of gas control,

and you're able to deploy data and analytics solutions at the point of need, which is typically increasingly

where the data is being generated.

And that could be at the network edge. It could be

public data centers and etcetera, etcetera. So that's really what distributed cloud is. And rather than challenges, it actually opens up a whole bunch of opportunities

in this area. And when you think of, again, Gartner saying within a couple of years, 50% of all data is gonna be generated outside of a a corporate data center, and in 2025, it's gonna be 90%.

The idea

of pulling tons of data back from the edge to process within a public cloud

isn't gonna work. We're not gonna be able to backhaul that amount of data. We're gonna need to push out the processing towards the edge, process it there, filter it, combine it, aggregate it there, and only send back what we need. So this is where interesting topologies start to arise. You can imagine

databases being deployed

close to the field deployed data centers associated with 5g

telecom applications where you might have a 100 antennas connected to a single field data center and deploying on something like Kubernetes, an analytics stack there to do localized

filtering and low latency

connections with connected cards

there, while at the center, it's collecting that history and doing classical historical processing.

And as you're talking about that where you're pushing the actual database engine into these edge locations, it makes me think about some of the kind of current architectural, quote, unquote, best practices

of using something like Kafka or Pulsar for streaming those messages into a centralized location so that you then have this, you know, piece of, quote, unquote, big iron, you know, in the rack and server days. But these cloud data warehouses where all your processing happens in 1 location

or, you know, having something like, swim OS where you're doing some edge processing on streaming data that you then, you know, maybe send some of the filtered data back to the central location for doing that big processing

and instead using the database engine as the entirety of that

architectural patterns that come about with something like Yellowbrick where you have the native capability of running across these different platforms versus what has kind of come to be known as the best way to do things given the set of technologies that people have settled on? Just to address the first part of things, I think 1 of the things that databases haven't got right up until now, and this is, I think, a key thing that Yellow Brick has sold, is the idea of how to properly handle

streaming data and near real time analytics and low latency access to that data. Because, typically, data warehouses are all about batch processing. And even if you look at the breed of cloud data warehouses, so called, you know, next generation

new data warehouses like Snowflake or or even Redshift, they don't handle streaming data. We came up with some we made some particular

architectural decisions and implementation decisions

to make us equally good at streaming

single record inserts into Yellowbrick as we are

doing

batch high volume loads as well.

But if we take kind of take a step back and think about how

Yellowbrick is architected,

I mean, we're built from the ground up to be a MPP scale out data warehouse. So we can scale down to the very, very smallest kind of use case

up to multi petabyte use cases, and we've got customers operating at those kinds of scales today.

Our key value proposition is on our industry leading price and performance, and so we've optimized

the software,

the hardware instances we run on to really maximize the performance and and minimize the price at every point. And that's what really stands us out aside from being able to handle streaming data in a way that the other data warehouses can't and being able to deploy in a range of different on prem or through the cloud options as well. Talking a bit more about some of the use case for this distributed cloud architecture,

maybe if we can follow on a bit more on,

you know, some of the existing patterns that people have started to use that they can replace with Yellowbrick or some of the ways that people have traditionally thought about other types of database architectures where Yellowbrick might be feasible given the way that it's designed?

Yeah. And to be very clear, so Yellowbrick is an

SQL data warehouse relational data warehouse fully focused on solving those data warehousing problems that traditional vendors and the new breed of data warehouse vendors are going from. So what we typically find ourselves being deployed in is kind of 3 technical use cases.

First of all, we're going in and replacing

legacy data warehouses

like Oracle and Teradata and SQL Server that are proving too expensive, too difficult to administer,

and too difficult to expand. And so we we're going in there and doing kind of a a straight replacement or new workloads would come to Yellowbrick. So that's kind of 1 use case because, typically,

we actually look and smell like Postgres from the outside. But underneath the hood, we are very, very, very different. But what that does mean is that ecosystem

of BI tools or data integration tools that a lot of enterprises have typically

just

integrate very seamlessly when they're thinking about kind of moving to a modern data warehouse.

The second set of technical use cases we typically

go into are around data lake augmentation.

This is the situation where people are using data lakes like like Databricks or kind of Hadoop and Parler combinations

or Presto, and they're trying to use them as data warehouses.

And, funnily enough, they're not getting the performance or the price that they need for their particular use cases out of that. Data lakes have been great in terms of the flexibility and the different kind of analytics you can bring. But if you wanna do highly structured

relational processing against that data, then the best place for it is in any warehouse design for that shape of data. And last but not least,

actually, is we're getting a lot of traction

for the wave of customers that have jumped into the cloud,

adopted cloud data warehousing

solutions, and then realized the consumption model that they employ is too unpredictable

in terms of their spend. So they can't predict how what their budgets are gonna be for the next year. And so we have a very, very simple flat rate subscription

within Yellowbrick. And so your spend is very predictable.

It's lower, and a lot of enterprises are really valuing that. So these are kind of the 3 areas that typically people bring us in for.

And in terms of the kind of main focus of the product and the overall use cases that it's aimed aimed for and the way that it has been designed as a result. I'm curious how those goals and designs have shifted

since the product was sort of first conceived of and as it starts to hit the realities of the market and the realities of the use cases?

Good question. You know, and the strategy has evolved, but it's worth quickly illustrating what the strategy is and what the approach we've taken architecturally

from the start because it it remains absolutely core to what we do going forward.

So as I mentioned, back in the day when data warehousing companies were trying out SSDs to see if they could speed up their processing and realized they couldn't,

The way that Yellowbrick is designed is by taking

very, very full advantage of technologies like NVMe SSDs

and the ability to route data directly from SSDs into the L3 cache on CPUs.

So that cuts out random access memory. It cuts out the main memory. And, typically, databases use main memory for buffer cache because the idea being, it's obviously faster to access.

At least traditional wisdom is, it's obviously faster to access data out of memory into CPU than it is from disk directly to CPU.

But with NVMe, that simply is not the case, and you can get the same bandwidth out of SSDs via NVMe as you can get out of main memory.

So what we do is essentially when we run a query,

we directly address from the CPU data within the SSDs. Those get loaded into L3 cache, and then our entire software layer takes over. Now it's not just the inefficiencies of main memory that we bypass,

but we've also looked up the software stack as well. And if you look at pretty much every software vendor out there in the warehousing space, they're all building on a stock Linux kernel.

So we start off with Linux,

but we actually run-in what's called a a user space bypass kernel. In other words, it's a Linux process that basically on startup says, Linux, get out of the way. Yellow Brick's taking over. We're gonna manage the memory, the threading, and scheduling. We're gonna manage a storage stack, the network stack, which device drivers we use. And Linux is relegated just to a certain sort of monitoring and logging

capabilities.

Because if you do a deep analysis,

the network stack in Linux can be made 20 times more efficient than the standard networking stack in there

today. The storage and IO,

if you have more creative alternatives, it can be a 100 times faster and more efficient.

And then you build on top of that new programming paradigms

around sort of reactive programming,

coroutines, and things like this. What it means is all the way up the software stack on the execution of a query, we are in charge of making sure that memory doesn't get fragmented.

We minimize context switching, and we keep those L3 caches

as on this on the money as possible and avoid cache misses at every possible point. And so we've done a huge amount of work to make sure that when you run a query,

everything that you need to satisfy that query is as much as possible kept in that local l 3 cache for the best performance, and that's what gives us 5, 10, a 100 times improvements in performance compared to other data warehouses. And so that theme has

stayed with us all the way through. But what has changed is up until now,

we've been providing our own specialized hardware to go with

a software database.

But now we're seeing the cloud vendors

and their hardware instances catch up. And so today, you get AWS or Azure instances with NVMe SSDs in them,

And and we can take advantage of that out of the box. And so all of the performance improvements that we have and have developed over the years now apply into the cloud domain too.

And then in terms of the actual

software stack, as you're mentioning, you're able to gain better performance because you're optimizing for a specific workload, whereas

I know that some some of the anecdotes of what's still in the Linux kernel, like being able to, you know, run a floppy disk is just wasting space and wasting CPU cycles.

I'm wondering,

particularly, as you're talking about the distributed cloud, what are some of the optimizations that you've built in in terms of the

communication

capabilities, but also in terms of the management interface of being able to gain visibility

of the data assets that you have across those different

cloud environments that you're working with.

We just announced the release of a new component of our portfolio called Yellow Brick Manager.

And 1 of the ideas of of distributed cloud, as I mentioned earlier, was that single pane of glass,

single control point on which you manage your kind of data analytic services. That's the ultimate vision of distributed cloud, which, of course, goes beyond just data warehousing.

We like to position ourselves as a data warehouse for such an environment.

Now what Yellowbrick is is our

single

unified control plane. And so within Yellowbrick manager, the vision is you will be able to provision

instances of the Yellowbrick database at the edge, in public cloud, within private cloud data centers, and administer and access them all from within this single pane of glass, essentially.

That's kind of our first

jump into the foray of of distributed cloud. But, you know, 1 thing that really enables us to do this is our adoption

going forward of containerization and Kubernetes as

our run anywhere deployment platform of choice. Because once we have containerized software, all this OS kernel that I mentioned and all the stack of software above it, we can now deploy

on a 5 year old laptop all the way up to, you know, the most specialized hardware instances

available in the public cloud. As far as the actual use and management of the database,

are there any sort of considerations

that go into data modeling that are specific to the way that you might use yellow brick versus the kind of traditional,

you know, either star or snowflake schema or the current sort of trends towards wide tables,

particularly as you're working across these different cloud environments

for being able to optimize, you know, data locality or being able to potentially join across datasets in these different locations?

Yellowbrick is really aimed at those orthodox

data warehousing use cases, first of all. And so we are fully relational, ACID semantics

in place.

And so all of the inserts, updates, deletes transactions fully supported within Yellowbrick.

The notion of primary key foreign key relationships are in there as well. And so the idea is that if you got a data warehousing scheme or in another data warehouse, it's a very simple exercise to replicate that within Yellow Brick. You don't have to do any

crazy denormalization

processes to get performance out of it. All of the joins, we can do joins at massive scale at very, very high performance.

That all works out of the box. Yeah. We have examples where customers have moved their data and consolidated out of a bunch of, for example, SQL Server instances,

lifted and shifted that into Yellowbrick and seeing a 150

x performance improvements out of the box

on this. Use cases in Teradata where out of 250 queries migrated, we beat Teradata on 248

of them out of the box, and they all kinda ran. So, you know, lots of examples. What we try to do is make it as easy as possible to translate what you're doing today into what you can do tomorrow with yellow brick. And as I mentioned earlier, around the PostgreSQL

dialect that we support, together with extensions for certain

aspects of Oracle and and other languages.

It becomes even easier. Your you don't need to change any of your tools.

Your Tableau and your Power BI still work, and so does your Informatica and data stage kind of integration tools too. Actually, you didn't answer the second part of your question, which just came back to me, which is around kind of federated queries, I think. That was the idea. So the idea

is, interestingly

well, first of all, the answer is we do this today through partners. So we have a a good partner called Denodo, which you probably may have heard of, which provide the ability to do query federation across a range of sources, and 1 of those sources being Yellowbrick. So we're not working

on the idea of query virtualization or data fabrics ourselves.

But what is really interesting actually is when you think about distributed clouds and where the data is gonna be located,

whether it's because it's being generated there or for data sovereignty or gravity reasons or privacy reasons, it's gotta remain in a particular data center or country or something like that,

I think people, in general, wanna you need to minimize the movement of data, and you wanna be pushing the analytics to the data rather than pulling the data back. It's just too inefficient. Too much time is spent moving data around before you get any value from it. And so, you know, distributed cloud, in my view, is all about

squeezing and achieving value for where the data lies. And I think when the volume starts to get really, really large, the idea of federating queries across will start to get expensive as well. And so I think that's an element that's gonna be interesting to see play out. Yeah. It definitely pays to think about the questions that you're trying to answer and being able to get a fragment of the answer across the different locations and then join those fragments rather than trying to join everything to build the entire answer at once? Well, because you start to get hit by e data egress fees from public clouds. So you gotta push down a lot of processing. You gotta do compression

over wide area networks. So as obviously, as well as encryption, you there are a lot of technical things you need to think about to minimize to make it cost effective, essentially.

And in terms of the query planner in Yellowbrick, is that something that you've started to think about as far as being able to offer some intelligence to, I'm trying to answer this question on this dataset, but I have copies of this data across multiple locations being able to push that down, get the sort of intermediate responses, and then join that back at the point of query?

No. I mean, we're not looking at that. I'd say that's really in the realm of things like Denodo.

We have a very, very sophisticated

cost base optimizer within Yellowbrick, which can turn the most horrendous SQL queries into something,

that's compiled that's rather quite beautiful. So we we're focusing on making Yellowbrick and its

parser, planner, compiler, and all those pieces the most, efficient and fastest anywhere.

Going back to your point earlier about the use of streaming data and optimizing

for single row inserts as opposed to forcing people to batch up their inserts and updates. I'm wondering if you can talk through a bit of how the actual on disk representation

is designed in order to be able to allow for those single row inserts without taking a performance penalty or without having to,

you know, do large data reshuffles as you get to a certain tipping point of those individual, you know, inserts or updates? The way we've tackled that, again because we had the opportunity to start, you know, 6 or 7 years ago from a kind of a blank piece of paper, and knowing what the emerging

near real time streaming use cases were coming. So what the team have put together

is Yellowbrick as a hybrid row and columnar store data warehouse.

And so you can think of data being stored in 2 ways. Streaming data comes in row by row, record by record into our row store component.

And over time, it's aged out of there and compressed into our columnar back end MPP scale out. And so that's kind of how we handle it. Obviously, columnar databases are traditionally terrible at doing inserts.

So we do inserts via individual record inserts via the row store, and then those automatically get moved out, as I mentioned. But as far as a query is concerned, we unify that entire picture. And so we're transactionally consistent between our row store and our column store. And when we reach a certain point, a certain volume within the row store, as I said, records get just automatically moved out. So that's how we reconcile the need to have high performance bulk loading into the columnar store at the back end while at the same time being able to support out of the box

subscription to a Kafka topic and stream record by record into YellowBrick. And so in terms of YellowBrick's capabilities,

I'm wondering as you have worked with it and as you've worked with customers who are deploying it, what are some of the technical aspects of it that you personally find most intriguing or that you have dug deep into just from a point of personal curiosity?

We could spend all day talking about some of the things that I've discovered and thought, wow, that's really impressive. But you know what? From a customer perspective, 1 of the things that stood out for me when I started first looking at Yellow Brick was the fact that there are no customer facing indexes that you need to define. All the indexing is handled at the back end.

There's no management of indexes. You don't have to do things like gather statistics. All that's done automatically.

And so the level of admin compared to a lot of data warehouses is really low.

You know, we do have some things like when you define schemas, you can define what kind of data distribution you want to put out across a particular table, whether it's, you know, a hash based distribution or a large fact table or a replicated distribution for reference tables.

And we also have the notion of things like partitioning and sorting and clustering.

But for those particular ones, in most cases, you don't need to bother with them. We don't take particularly strong advantage except in particular edge cases of those. And so what I'm really getting at is the DBA admin overhead for Yellowbrick is is very, very low, and and that really translates directly into cost savings for our customers.

As somebody who is stepping in as the CTO and given the fact that the company is already a few years into their journey, I'm wondering what you have found to be some of the

interesting challenges

in getting up to speed with the architectural decisions that have been made and understanding

how you can bring your own vision

and ideas to help the product succeed both technically and in the current database market.

What was really exciting, 1 of the things that attracted me to Yellowbrick was they were doing something that no other data warehouse vendor was doing. They were doing things in a way coming at that problem from a different angle, solving

the shrinking of the data path problem, which is really what they've done at every level if you think about it. And that's that's

really what a huge part of data warehousing is, just shrinking that data path.

Having such a strong team there with a diverse set of backgrounds

looking at this problem and then when I came in and started talking to customers and realizing that this thing actually does what it says on the tin, You know, it sounded almost too good to be true, and I've even had that response back from customers that I've been speaking to while I've been at Yellow Brick. They said, we thought this was too good to be true. When we POC ed it, we realized it really did do what it said, and it really does work. And so that is obviously very compelling to join. But I think, for me, a big part of my role here is to look beyond where we need to go next. And for me, it is about things like distributed cloud. It's about making our capabilities available more broadly on the public clouds directly

and exploiting these technical advances on the new kind of instance types there.

But more than anything, it's about putting a compelling user experience around a data warehouse because data warehouses typically are not known for their user experience.

And we all know where the customers I speak to, whether it's in the business units, the lines of business,

the data analysts, or even

the DBAs,

the idea of doing self-service analytics, being a citizen data scientist is really, really important. The idea of, as a developer, making it easy to develop on a platform like this is important. And so for me, the user experience around Yellowbrick

is something that's really important and where we continue to invest and build out. Yellowbrick manager is a first kind of exposure to the public of what we're doing. As we move over time, you'll see our user experience from end to end just get better and better. And

building on the fact that this thing is incredibly fast and quick, you can get subsecond response times for many of your query workloads. And so if you're a developer

against this, you can manage that iterative cycle of building reports and getting answers very, very quickly,

deploying on your laptop, deploying in the public cloud ultimately, and things like this, which which for me is where

we will see the main traction of our business grow, I think, in the future. I think that's what people in the business are looking for.

Patrick is a diligent data engineer,

probably the best in his team.

He

changed

the

syntax.

He

changed the schema. He gave it his everything. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day.

Sarah from business intelligence says 5 minutes is way too long.

John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday.

Want to become the liberator of data?

Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources.

No more waiting, no more huge expenses, and your hard work finally pays off.

Firebolt is the fastest cloud data warehouse. Visitdataengineeringpodcast.com/firebolt

today to get started,

And the first 25 visitors will receive a free Firebolt t shirt.

And to your point of user experience and shortening the iteration loop of developers and data analysts and data scientists, I think it's worth digging into a bit more, particularly because as you mentioned, you do have this fixed cost approach so that there isn't any concern of if I run this query 15 different ways or 15 different times, I'd have to think about, you know, how much is that actually costing. You know? Is this worth the cost of paying for the query to be able to get the answer even if it's not necessarily,

you know, the best use of the time and energy? And I'm just wondering if you can talk through a bit of

your experience or the experience of customers that you've worked with as far as coming to the realization of being able to unlock more of that experimentation

and exploration mentality?

You hit a nail right on the head there because it it is that they no longer have to stop and pause and think, can I actually afford to do this? Can I actually afford to run this? Or what instead Yellow Brick is giving them is the complete freedom now. You can just completely throw as much as you want at this. We're not gonna charge you for the amount of concurrent users, for the amount of queries, for the length of the queries, the number of resources taken. When you have a flat subscription that we provide, you can just eat as much as you want of it. And so I think it really does open up the possibilities for developers and data scientists to really just do their ad hoc analytics, their data exploration,

and just to try out new use cases and what if scenarios

in a way that they typically couldn't if you were constrained by a consumption model. We saw big customers of ours who

really have grabbed hold of that opportunity with both hands. And 1 of our largest customers

today essentially

uses us with 4,000 registered users on a single system

of data scientists and analysts

to do things like anomaly detection over petabytes of data

in real time, you know, interactive queries, And they can do that with Yellowbrick, and they couldn't afford to do that with their other solutions at all. So I think we're opening up huge opportunities at scale for these customers

at a totally affordable price point.

As you continue to work with customers, what are some of the capabilities of Yellowbrick that you find are often

overlooked or underutilized

that you think should be, you know, made more prominent or that people should pay more attention to or try to use in more of their workloads?

Yeah. And you know what? There's 1 area and I think it's not necessarily that it's overlooked or underappreciated.

I think the capabilities that we have for streaming ingest, for things like

Kafka, aren't as used now as much as they will be. And I think that's actually as much to do with the use cases for true real time analytics right now in business

and also what real time means to lots of different people. For some people, real time can mean, I just want my dashboard updated once a day. You know?

Or for others, for example, in the retail industry where you're looking at customer satisfaction

or call center operations where you want to try and intercept a customer making a call before they make that call. You track their kind of customer journey, and you decide what are the set of events that are likely to make that call, and you intercept that and you get ahead of them where on the button on the pulse insights are required. That's where we're seeing some interest. I mean, we some of our customers are using that streaming capability. And to give you 1 example

is 1 of our customers called LexisNexis,

and they have a product line called ThreatMetrix.

Now if you've ever purchased anything online

that isn't through Amazon,

so anything else,

chances are that online

transaction

of your payment

has gone through ThreatMetrix

online portal, which a lot of retailers sign up to detect

real time fraud on payment. And that's all back ended by Yellow Brick at the end of the day. So chances are you may have touched Yellow Brick already without knowing it, actually. So but they have very strict SLAs. They wanna be able to detect these kinds of fraudulent activities

within kind of a 5 second

time frame. They were using things like Impala before, which was getting nowhere near. They were talking minutes to get an answer out of that. And so that's 1 area where we're seeing some of these true streaming use cases come about here. But I think over time, the application of that particular capability is just gonna go through the roof.

In terms of the actual adoption of Yellowbrick, can you talk through a bit about what's involved in actually getting it set up and deployed and integrated into a customer's infrastructure?

Yeah. And, hopefully, I've built up a picture now that in many cases, given the Postgres compatibility

so, you know, there are a lot of data warehouses out there already that are legacy that also had some component or based their dialects in terms of PostgreSQL.

And so you think about Redshift. We have had customers for Redshift who've migrated to us very straightforwardly because it's a Postgres

based database,

same as Vertica, same as Greenplum, same as Netezza. They all

have Postgres parses at the front end if you like. And so

lifting and shifting or migrating from those kinds of sources is very, very easy. If you couple that with the fact that all of your ETL tools, all of your business intelligence tools work out of the box,

the process of kind of inserting Yellowbrick into that workflow becomes a lot easier. And we can typically take your schema, your DDL.

It just translates over very easily.

And in many cases, the workloads, which is always the hard part, it's always the applications,

the heavy lift in any migration

activity.

Again, based on the Postgres kind of compatibility, that really does lower the barrier.

Just as far as, like, deploying the actual database itself, though, I'm also curious in, you know, what's involved in that process of, you know, getting it set up and deploying it and actually managing it within your network environments or within your, you know, systems architecture?

Yeah. So we actually offer Yellow Brick in 2 ways right now. You can actually

consume Yellowbrick on subscription as a service through your public cloud, so via AWS or Azure. And what we do there is we manage everything at the back end. So we manage upgrades, backups, maintenance, all of that kind of thing. And then we set you up with a private link connection to your VPC in AWS, for example.

And you just see an endpoint to Yellowbrick, and you get 10 gigabit link with 10 milliseconds latency access to it, and you can just go ahead and and zoom it in that way. Or we can deploy

our specialized instance hardware in your own data center, and you can manage it yourself.

It's a really, really small footprint

of system. It typically

takes up less than half a rack just like sort of 12 14 u within your data center,

which means it's low on power consumption, low on cooling,

doesn't take up much space within the data center. And we've had instances where we've gone in and replaced kind of 6, 12 racks of Netezza with, you know, half a rack or less a quarter of a rack of Yellowbrick

for those customers that wanna deploy it themselves.

But going forward, we are gonna be launching

software only versions of Yellow Brick that are deployed natively in AWS or Azure

and then also natively on prem Kubernetes stacks like OpenShift, for example. So we're we're broadening out our deployment options

this year and beyond.

In terms of the ways that you've seen Yellowbrick used or the environments where it's deployed or the specific applications built on top of it, I'm wondering what are some of the interesting or innovative or unexpected ways that you've seen it deployed and used.

Typically, we aren't used in

business critical use cases. And so this is

where consumption models don't work at all. You're always on environment like that fraud detection example I gave earlier. You're operating 247.

You can't afford to go down. And so, you know, the high availability capabilities we have built into Yellowbrick and the asynchronous replication between Doctor sites that we have built in. All of that kind of comes into play.

Where we're finding

most interest

is in a range of verticals, but financial services and insurance and and in telco

are 3 of the biggest verticals that we're finding ourselves in. But we have plenty of examples in retail

and health care and beyond as well.

I think 1 of the really interesting use cases that we find ourselves in, I think probably some of the ones in the telco space are quite interesting.

For example, again, you probably don't know this, but if you're an AT and T or Sprint

user of your phone, you know, when you roam around the US and you move from cell to cell, you're the cell that you're using at any 1 point in time may not belong to the company to which your own subscription belongs to. And these telcos have relationships between each other, roaming to go on, and they reconcile the costs as people roam between their different networks.

And so what we have behind the scenes by 1 of our customers called Tioco

is Yellowbrick running behind the scenes

to reconcile

all of these different intercarrier

roaming charges

every day. And so we're processing something like 40 to 500000000

records a day in Yellowbrick on a single system

to reconcile all of this movement. And I I thought that was quite fascinating there again. What I love are the use cases where I don't realize I'm using Yellowbrick, but I am. And there's other use cases that stand out for me. But, you know, there are a whole bunch of just incredibly important use cases. Another 1 that's really interesting is is 1 of the largest insurance companies in the world

standardized on Yellow Brick for their enterprise data warehouse, and they were doing it for their claims ratio processing and all those kind of, you know,

just stuff that you do within the insurance business. But what was really interesting is that they're actually using Yellowbrick for their HR analytics. Such

HR, but it turns out they've got 2,000,000 employees and associates across the world. And so

their HR problem starts to blur into kind of a a significant data size problem as well.

And in terms of your own experience

of coming to Yellowbrick and acquiring the CTO role and starting to,

you know, work through

building the product and using the product and helping your customers understand the product? What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

What I think surprises a lot of people is is the speed at which you can just load and go with this thing because you don't have to worry about indexes and things like that. You can throw your data in there against schema and know you're gonna get pretty blistering performance.

And 1 of the reasons for that is, as I mentioned, you know, we basically can address all of your data on SSD

directly from the parallel CPU cores without having to go via main memory. And so what that means is we effectively act as an in memory database but over petabytes of data, and that's really the kind of performance characteristics that you get from it. We meet very, very few challenges

here. I mean, usually,

as with any data warehouse

process, when you're looking at new vendors, it's usually around

migration

is a challenge. But what we found is that we've been able to get customers up and running on Yellow Brick in a surprisingly

fast way. And and that's not all about Yellow Brick's technology.

It's a big part of it is that, but we have strong partnerships with migration partners who have automated tools and they have the skill sets. They've done this stuff over and over before to really shrink down the time to value

from moving off your existing EDW onto Yellowbrick. So I think the challenges out there are mainly about migration.

I can't really think of any technical challenges

around Yellowbrick. I know for certain it's the easiest data warehouse I've ever come across in my career in terms of usage.

As you were talking a bit more about the disk access and being able to work across petabytes of data potentially,

I'm curious what the scale out process looks like for

yellow brick as far as being able to scale across multiple disks or multiple CPUs,

you know, whether that's something where you would scale vertically on a single machine or if it also has a horizontal scaling story.

It's horizontally scaled. So what we have is Yellowbrick systems can start off at 3 nodes

as a minimum, and they scale up in our current platform to 40 nodes.

But each of those nodes has kind of 64

core AMD processes in,

a ton of NVMe SSDs,

and you can add them node by node as well without end any interruption in service. So we're an MPP scale out data warehouse, a shared nothing architecture.

But when you add another node to your Yellowbrick system, we handle the data redistribution

in the background. There's no interruption in your service when you do that, and so it's very, very easy to incrementally scale out this thing. And as I mentioned earlier, you can imagine the anatomy, the lifetime of a query. Query comes in from your BI tool like Tableau, hits our planner. We create a plan tree, a query execution plan out of that, which gets compiled down. And this compiled code

gets deployed onto these workers on each of these individual nodes, which process

that query against a part of the data on that local data.

That even includes complex joins and aggregations.

We can do this in a highly concurrent environment with many concurrent users because we've also got a very sophisticated workload management scheme and capability in place there that will properly divide up the resources around memory and storage and CPU around using our system and prioritize different workloads. And so long answer to your question, but we just saw a scale out capability.

As far as the actual

distribution, so is it something where you have the sort of planner node where you have consistent hashing for being able to determine where the actual data sits on disk? Or is it you mentioned it's shared nothing. So is it more where each of the nodes is sort of read, write primary where it doesn't matter which instance you hit, you're able to read and write across the systems? Or do you have similar to the architecture where you have, like, a name node kind of thing where it's responsible for keeping track of where everything is? Or, you know, like with Mongo where it's the Mongo's process for being able to determine which instance has the appropriate shard that you're trying to work

on? Yeah. It's all purely classic hash based processing. And so, you know, when you're looking up a particular point of data, you're generating a hash key, and that's mapping to a particular worker that owns that shard of data, for example, as well. So it's a kind of a very common way of executing

the data. The worker just operates on its piece of data. So if you look at your query execution plan at the very leaves

of the tree, these are the table scan nodes that access data on a particular worker, and this all feed up the tree to the root of the tree, which is the point at which you get your results set out. And they go through various join and aggregates and sorting and partitioning

stages within the nodes within that graph because it's actually a graph rather than a tree. That's essentially

how it works. Any node operates on its portion of the data, but, obviously, they're all interconnected

at the back

end. On our hardware instances, that's InfiniBand, you know, 200 gigabit connectivity between each worker.

And so when it needs to redistribute data as part of the query execution plan, it's very, very efficient to do that. It's a classic kinda shared nothing

distributed query problem.

And so for people who are looking to build out a data warehouse system and they're considering Yellowbrick, what are the cases where it's the wrong choice and they might be better suited with, you know, traditional data lake architecture

or, you know, a different approach to data warehousing where maybe they wanna do something like, you know, ClickHouse

or some other solution for being able to manage their data. You know, maybe they just put it all on a Postgres mode. I think that pretty much does characterize the boundaries

here where we're the right choice.

I mean, if you've got a relatively small amount of data, if it's less than a terabyte, if it's a few 100 gigabytes,

then there are choices out on the market that would suit you just as well as Yellowbrick at that kind of small scale. It's when we get into the anything into the 1 to 10 to a 100 to petabyte

scale is when Yellowbrick really delivers on the price and performance.

And it is all about

delivering

10, a 100 times performance at about a 5th of the cost of those large enterprise

data warehouses. But what I would say is that a lot of our customers aren't necessarily starting at the petabyte scale. They could be adopting Yellowbrick in a departmental use case that's fairly small. And then what we see is the usage of Yellowbrick expand across

the customer base. And almost 2 thirds of our customers, for example, have expanded their usage

of Yellowbrick

substantially since they acquired it, and we're seeing that across the customer base. They're satisfied,

does what it says on the tin, and they're getting the value out of the analytics and out of the investments they're making there. As you continue to plan for the near to medium term of, you know, the the technical and product direction of Yellowbrick, what are some of the things that you have planned and that you're most excited to work on? We're fully getting behind with our partners around this distributed cloud vision. Distributed cloud isn't something that Yellowbrick will wrap up and take to market as a product. But what we see what I see as a blueprint is an architectural pattern

that will emerge as these private as these public cloud rather stacks start to appear in different locations.

And with our adoption of Kubernetes and containerization

as a way of deploying and managing and orchestrating

Yellowbrick software anywhere

in the public cloud, on premises, etcetera, at the network edge.

That's where I think things get really exciting for us, and I think it's gonna set us apart

from other vendors in this space because we will truly be able to say, we can run anywhere. We can satisfy use cases

at the network edge in the smallest levels

at a streaming data

kind of use case all the way up to the standard classic centralized

data warehouse at petabyte scale. We're in a position to provision instances and give you a user experience

that is uniform across all of this

geographically disparate deployment

kind of topology that you could have.

I think that's really exciting to me, and I think it'd be really interesting to see how YellowBit could get applied to new IoT use cases, for example, that are gonna emerge here that I think other data warehouse technology and databases will struggle to adopt. So for me, it's all about we're keeping the foot to the floor in terms of the performance and the efficiency.

We're

making leaps and bounds in terms of improving the user experience of data warehousing overall, and we're doing that in a way that's maintaining that price and performance advantage. And those are the kind of 3 things that are are kind of driving our our road map forward.

Are there any other aspects of the yellow brick project or the use cases that it's built for or your experience

of helping to drive the technical direction for it that we didn't discuss yet that you'd like to cover before we close out the show? There were things we could talk about around the built in replication capabilities that we have that you can replicate data

between geographically remote yellow brick instances for Doctor purposes.

I'd love to talk a little bit more about

how we do highly resilient

data resilience

data resilience and reliability.

Many data warehouses very quickly, you know, in order to maintain high availability

on a single instance, they replicate data, you keep 3 copies of the data like you would in Hadoop or many data warehouses mirror data. But we actually use erasure encoding

to ensure that if we lose nodes within the system,

we don't lose data, and and we can do that with a fraction of the overhead in terms of copies of the data than just simply doing brain dead mirroring. So I could probably go on all day, but I I'll leave it there. But if you wanna know more, I'd just say come to our website. We've got a 40 page white paper

that goes into all of the technical intricacies around yellow brick, and we've just laid it all bare for everyone out there. So if you wanna know more information,

check out that. Check out our movies and videos on there from our recent summit. They go into the technical details. You'll see demonstrations there. You'll even see the use of Denodo

for federated queries as well. So, yeah, that would be my advice.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes, and we'll add links to all the things you mentioned.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the 1 of the biggest gaps is still around the upper layers around data cataloging, I I think, for me. I think I still haven't seen a solution that's adequate,

especially when you start to think about distributed clouds, having your data separated in geographic locations. I think we're doing a

and

I

think there's a gap a gap there. Yeah. Is getting harder and harder. And I think there's a gap a gap there.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Yellowbrick. Definitely very interesting project and 1 that I'm excited to see where you go with it. So thank you for all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Pleasure speaking with you.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links