Summary
The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Mark Cusack about Yellowbrick, a data warehouse designed for distributed clouds
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Yellowbrick is and some of the story behind it?
- What does the term "distributed cloud" signify and what challenges are associated with it?
- How would you characterize Yellowbrick’s position in the database/DWH market?
- How is Yellowbrick architected?
- How have the goals and design of the platform changed or evolved over time?
- How does Yellowbrick maintain visibility across the different data locations that it is responsible for?
- What capabilities does it offer for being able to join across the disparate "clouds"?
- What are some data modeling strategies that users should consider when designing their deployment of Yellowbrick?
- What are some of the capabilities of Yellowbrick that you find most useful or technically interesting?
- For someone who is adopting Yellowbrick, what is the process for getting it integrated into their data systems?
- What are the most underutilized, overlooked, or misunderstood features of Yellowbrick?
- What are the most interesting, innovative, or unexpected ways that you have seen Yellowbrick used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Yellowbrick?
- When is Yellowbrick the wrong choice?
- What do you have planned for the future of the product?
Contact Info
- @markcusack on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Yellowbrick
- Teradata
- Rainstor
- Distributed Cloud
- Hybrid Cloud
- SwimOS
- Kafka
- Pulsar
- Snowflake
- AWS Redshift
- MPP == Massively Parallel Processing
- Presto
- Trino
- L3 Cache
- NVMe
- Reactive Programming
- Coroutine
- Star Schema
- Denodo
- Lexis Nexis
- Vertica
- Netezza
- Grenplum
- PostgreSQL
- Clickhouse
- Erasure Coding
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand, and they started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/atlan, that's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Mark Cusack about Yellowbrick, a data warehouse designed
[00:02:04] Unknown:
for distributed clouds. So, Marc, can you start by introducing yourself? Hi there, Tobias. Very nice to be here with you today. Yeah. My name is Marc Cusack. I'm the chief technology officer at Yellowbrick. I've been with Yellowbrick for around about the last 7 months. And before joining as CTO, I was an exec at Teradata where I led the product management team responsible for the data warehouse software and the advanced analytics machine learning portfolio there. And I was at Teradata for around 6 years, and I pretty much have owned at 1 point all of the product portfolio at Teradata. But I came to Teradata actually by way of an acquisition they made in 2014 for a startup called RainStore, which was in the data warehouse archiving market. We built that up from scratch over about 10 years. So I've been in amongst databases and data warehousing for a big chunk of my career. And do you remember how you first got involved in the area of data management?
I remember back to the dim distant past. This actually came out of some work in the UK Ministry of Defense, and I was a research scientist there. And we got grants to research various aspects of how technology could help military training and development. And 1 of the interesting use cases they had was around how do you manage the data collected from large scale field exercises. You can imagine kind of fleets of tanks roaming across the plains, and they would stop these training exercises, And they'd have a whole bunch of data they gathered from all of these military systems, and they wanted to do what if analysis in the field. And as you can imagine, back in the early 2000, the technology that we had for storing, processing, and running analytics was pretty primitive. The footprint of the hardware was small. And so we had the bright idea of coming up with a a compressed in memory database technology that we could actually deploy in the field.
And so it turned out that this technology was pretty useful in other applications, and so we ended up spinning that out to form Rainstorm, the the startup company that that grew into this data warehouse and archiving technology. So it was a spin out to some military IP
[00:04:09] Unknown:
into the commercial world and a pretty successful 1. As you mentioned, you recently joined the team at Yellowbrick. So I'm wondering if you can just start by giving a bit of an overview about what the Yellowbrick project is and some of the story behind it and your motivation for joining the team there.
[00:04:23] Unknown:
So Yellow Brick is a modern data warehouse for the hybrid and distributed clouds, and we can break those down to what we really mean into the constituent parts by that. But by modern data warehouse, we've built a data warehouse, a database technology from the ground up using modern software development technologies and having a thorough understanding of the hardware platform that we're working on as well. And, hopefully, we'll get an opportunity to talk in more detail about that later. But the yellow brick is really interesting because Neil Carson, the CEO, Jim Dawson, and Mark Briddinkham were the founders, and they were at Fusion. Io, a storage company, a flash storage company.
Back in around sort of the 2,010, 2013 period, SSDs were beginning to become really interesting from a data warehousing perspective, and people traditional companies like Teradata or Oracle were thinking, well, how can we replace our spinning disks with SSDs? Will we get around some of the performance and bandwidth bottlenecks of those technologies, which are traditionally where the the bottlenecks in data warehousing analytics reside. And so they went to companies like Fusion. Io. I said, well, can we take those SSDs, shove it in our our data warehouse platforms, and get a benefit from that? When these companies did that, they found they didn't get a benefit. What they've done is move the bottleneck elsewhere up into DRAM, up into main memory.
And so Neil and the team had an idea of, well, what if we could take some of the ideas of how we route data from SSDs directly into the CPUs and bypass main memory? Maybe we could come up with a database technology that would cut out that middleman, cut cut out that bottleneck, and that's exactly what they did. Yellow Brick rather was founded in 2014. It came out of stealth mode in 2017. We've acquired quite an array of enterprise customers now. We've raised a $173,000,000 through Venture Capital, and we're having a fantastic year.
[00:06:19] Unknown:
In terms of the actual distributed cloud terminology, you mentioned sort of hybrid cloud, distributed cloud, and, you know, somewhat the same breath. I'm wondering if you can just talk a bit about sort of the challenges that are associated with working in, you know, a distributed cloud environment and what that really means in terms of the actual deployment topology and some of the types of environments where that kind of requirement will come up. Yeah. Well, I think Gartner actually coined the phrase distributed cloud, and they also have a a rather pithy way of saying that distributed cloud fixes what hybrid cloud breaks.
[00:06:53] Unknown:
And to just to define what a distributed cloud is, it's the idea of taking the public cloud hardware and software stacks in AWS and Azure and Google Cloud, for example, but locating them elsewhere. So in private clouds, in customers' own data centers, or even at the network edge. And so now you have a common unified set of hardware and software and APIs on which to deploy higher level services. Right? Another defining feature of distributed cloud is the idea you have a single control plane across all of these different locations where you're deploying cloud services to control all these things. So now you've got a homogeneous foundation, single pane of gas control, and you're able to deploy data and analytics solutions at the point of need, which is typically increasingly where the data is being generated.
And that could be at the network edge. It could be public data centers and etcetera, etcetera. So that's really what distributed cloud is. And rather than challenges, it actually opens up a whole bunch of opportunities in this area. And when you think of, again, Gartner saying within a couple of years, 50% of all data is gonna be generated outside of a a corporate data center, and in 2025, it's gonna be 90%. The idea of pulling tons of data back from the edge to process within a public cloud isn't gonna work. We're not gonna be able to backhaul that amount of data. We're gonna need to push out the processing towards the edge, process it there, filter it, combine it, aggregate it there, and only send back what we need. So this is where interesting topologies start to arise. You can imagine databases being deployed close to the field deployed data centers associated with 5g telecom applications where you might have a 100 antennas connected to a single field data center and deploying on something like Kubernetes, an analytics stack there to do localized filtering and low latency connections with connected cards there, while at the center, it's collecting that history and doing classical historical processing.
[00:08:57] Unknown:
And as you're talking about that where you're pushing the actual database engine into these edge locations, it makes me think about some of the kind of current architectural, quote, unquote, best practices of using something like Kafka or Pulsar for streaming those messages into a centralized location so that you then have this, you know, piece of, quote, unquote, big iron, you know, in the rack and server days. But these cloud data warehouses where all your processing happens in 1 location or, you know, having something like, swim OS where you're doing some edge processing on streaming data that you then, you know, maybe send some of the filtered data back to the central location for doing that big processing and instead using the database engine as the entirety of that architectural patterns that come about with something like Yellowbrick where you have the native capability of running across these different platforms versus what has kind of come to be known as the best way to do things given the set of technologies that people have settled on? Just to address the first part of things, I think 1 of the things that databases haven't got right up until now, and this is, I think, a key thing that Yellow Brick has sold, is the idea of how to properly handle
[00:10:13] Unknown:
streaming data and near real time analytics and low latency access to that data. Because, typically, data warehouses are all about batch processing. And even if you look at the breed of cloud data warehouses, so called, you know, next generation new data warehouses like Snowflake or or even Redshift, they don't handle streaming data. We came up with some we made some particular architectural decisions and implementation decisions to make us equally good at streaming single record inserts into Yellowbrick as we are doing batch high volume loads as well.
But if we take kind of take a step back and think about how Yellowbrick is architected, I mean, we're built from the ground up to be a MPP scale out data warehouse. So we can scale down to the very, very smallest kind of use case up to multi petabyte use cases, and we've got customers operating at those kinds of scales today. Our key value proposition is on our industry leading price and performance, and so we've optimized the software, the hardware instances we run on to really maximize the performance and and minimize the price at every point. And that's what really stands us out aside from being able to handle streaming data in a way that the other data warehouses can't and being able to deploy in a range of different on prem or through the cloud options as well. Talking a bit more about some of the use case for this distributed cloud architecture,
[00:11:37] Unknown:
maybe if we can follow on a bit more on, you know, some of the existing patterns that people have started to use that they can replace with Yellowbrick or some of the ways that people have traditionally thought about other types of database architectures where Yellowbrick might be feasible given the way that it's designed?
[00:11:55] Unknown:
Yeah. And to be very clear, so Yellowbrick is an SQL data warehouse relational data warehouse fully focused on solving those data warehousing problems that traditional vendors and the new breed of data warehouse vendors are going from. So what we typically find ourselves being deployed in is kind of 3 technical use cases. First of all, we're going in and replacing legacy data warehouses like Oracle and Teradata and SQL Server that are proving too expensive, too difficult to administer, and too difficult to expand. And so we we're going in there and doing kind of a a straight replacement or new workloads would come to Yellowbrick. So that's kind of 1 use case because, typically, we actually look and smell like Postgres from the outside. But underneath the hood, we are very, very, very different. But what that does mean is that ecosystem of BI tools or data integration tools that a lot of enterprises have typically just integrate very seamlessly when they're thinking about kind of moving to a modern data warehouse.
The second set of technical use cases we typically go into are around data lake augmentation. This is the situation where people are using data lakes like like Databricks or kind of Hadoop and Parler combinations or Presto, and they're trying to use them as data warehouses. And, funnily enough, they're not getting the performance or the price that they need for their particular use cases out of that. Data lakes have been great in terms of the flexibility and the different kind of analytics you can bring. But if you wanna do highly structured relational processing against that data, then the best place for it is in any warehouse design for that shape of data. And last but not least, actually, is we're getting a lot of traction for the wave of customers that have jumped into the cloud, adopted cloud data warehousing solutions, and then realized the consumption model that they employ is too unpredictable in terms of their spend. So they can't predict how what their budgets are gonna be for the next year. And so we have a very, very simple flat rate subscription within Yellowbrick. And so your spend is very predictable.
It's lower, and a lot of enterprises are really valuing that. So these are kind of the 3 areas that typically people bring us in for.
[00:14:04] Unknown:
And in terms of the kind of main focus of the product and the overall use cases that it's aimed aimed for and the way that it has been designed as a result. I'm curious how those goals and designs have shifted since the product was sort of first conceived of and as it starts to hit the realities of the market and the realities of the use cases?
[00:14:24] Unknown:
Good question. You know, and the strategy has evolved, but it's worth quickly illustrating what the strategy is and what the approach we've taken architecturally from the start because it it remains absolutely core to what we do going forward. So as I mentioned, back in the day when data warehousing companies were trying out SSDs to see if they could speed up their processing and realized they couldn't, The way that Yellowbrick is designed is by taking very, very full advantage of technologies like NVMe SSDs and the ability to route data directly from SSDs into the L3 cache on CPUs.
So that cuts out random access memory. It cuts out the main memory. And, typically, databases use main memory for buffer cache because the idea being, it's obviously faster to access. At least traditional wisdom is, it's obviously faster to access data out of memory into CPU than it is from disk directly to CPU. But with NVMe, that simply is not the case, and you can get the same bandwidth out of SSDs via NVMe as you can get out of main memory. So what we do is essentially when we run a query, we directly address from the CPU data within the SSDs. Those get loaded into L3 cache, and then our entire software layer takes over. Now it's not just the inefficiencies of main memory that we bypass, but we've also looked up the software stack as well. And if you look at pretty much every software vendor out there in the warehousing space, they're all building on a stock Linux kernel.
So we start off with Linux, but we actually run-in what's called a a user space bypass kernel. In other words, it's a Linux process that basically on startup says, Linux, get out of the way. Yellow Brick's taking over. We're gonna manage the memory, the threading, and scheduling. We're gonna manage a storage stack, the network stack, which device drivers we use. And Linux is relegated just to a certain sort of monitoring and logging capabilities. Because if you do a deep analysis, the network stack in Linux can be made 20 times more efficient than the standard networking stack in there today. The storage and IO, if you have more creative alternatives, it can be a 100 times faster and more efficient.
And then you build on top of that new programming paradigms around sort of reactive programming, coroutines, and things like this. What it means is all the way up the software stack on the execution of a query, we are in charge of making sure that memory doesn't get fragmented. We minimize context switching, and we keep those L3 caches as on this on the money as possible and avoid cache misses at every possible point. And so we've done a huge amount of work to make sure that when you run a query, everything that you need to satisfy that query is as much as possible kept in that local l 3 cache for the best performance, and that's what gives us 5, 10, a 100 times improvements in performance compared to other data warehouses. And so that theme has stayed with us all the way through. But what has changed is up until now, we've been providing our own specialized hardware to go with a software database.
But now we're seeing the cloud vendors and their hardware instances catch up. And so today, you get AWS or Azure instances with NVMe SSDs in them, And and we can take advantage of that out of the box. And so all of the performance improvements that we have and have developed over the years now apply into the cloud domain too.
[00:17:49] Unknown:
And then in terms of the actual software stack, as you're mentioning, you're able to gain better performance because you're optimizing for a specific workload, whereas I know that some some of the anecdotes of what's still in the Linux kernel, like being able to, you know, run a floppy disk is just wasting space and wasting CPU cycles. I'm wondering, particularly, as you're talking about the distributed cloud, what are some of the optimizations that you've built in in terms of the communication capabilities, but also in terms of the management interface of being able to gain visibility of the data assets that you have across those different cloud environments that you're working with.
[00:18:28] Unknown:
We just announced the release of a new component of our portfolio called Yellow Brick Manager. And 1 of the ideas of of distributed cloud, as I mentioned earlier, was that single pane of glass, single control point on which you manage your kind of data analytic services. That's the ultimate vision of distributed cloud, which, of course, goes beyond just data warehousing. We like to position ourselves as a data warehouse for such an environment. Now what Yellowbrick is is our single unified control plane. And so within Yellowbrick manager, the vision is you will be able to provision instances of the Yellowbrick database at the edge, in public cloud, within private cloud data centers, and administer and access them all from within this single pane of glass, essentially.
That's kind of our first jump into the foray of of distributed cloud. But, you know, 1 thing that really enables us to do this is our adoption going forward of containerization and Kubernetes as our run anywhere deployment platform of choice. Because once we have containerized software, all this OS kernel that I mentioned and all the stack of software above it, we can now deploy on a 5 year old laptop all the way up to, you know, the most specialized hardware instances
[00:19:42] Unknown:
available in the public cloud. As far as the actual use and management of the database, are there any sort of considerations that go into data modeling that are specific to the way that you might use yellow brick versus the kind of traditional, you know, either star or snowflake schema or the current sort of trends towards wide tables, particularly as you're working across these different cloud environments for being able to optimize, you know, data locality or being able to potentially join across datasets in these different locations?
[00:20:13] Unknown:
Yellowbrick is really aimed at those orthodox data warehousing use cases, first of all. And so we are fully relational, ACID semantics in place. And so all of the inserts, updates, deletes transactions fully supported within Yellowbrick. The notion of primary key foreign key relationships are in there as well. And so the idea is that if you got a data warehousing scheme or in another data warehouse, it's a very simple exercise to replicate that within Yellow Brick. You don't have to do any crazy denormalization processes to get performance out of it. All of the joins, we can do joins at massive scale at very, very high performance.
That all works out of the box. Yeah. We have examples where customers have moved their data and consolidated out of a bunch of, for example, SQL Server instances, lifted and shifted that into Yellowbrick and seeing a 150 x performance improvements out of the box on this. Use cases in Teradata where out of 250 queries migrated, we beat Teradata on 248 of them out of the box, and they all kinda ran. So, you know, lots of examples. What we try to do is make it as easy as possible to translate what you're doing today into what you can do tomorrow with yellow brick. And as I mentioned earlier, around the PostgreSQL dialect that we support, together with extensions for certain aspects of Oracle and and other languages.
It becomes even easier. Your you don't need to change any of your tools. Your Tableau and your Power BI still work, and so does your Informatica and data stage kind of integration tools too. Actually, you didn't answer the second part of your question, which just came back to me, which is around kind of federated queries, I think. That was the idea. So the idea is, interestingly well, first of all, the answer is we do this today through partners. So we have a a good partner called Denodo, which you probably may have heard of, which provide the ability to do query federation across a range of sources, and 1 of those sources being Yellowbrick. So we're not working on the idea of query virtualization or data fabrics ourselves.
But what is really interesting actually is when you think about distributed clouds and where the data is gonna be located, whether it's because it's being generated there or for data sovereignty or gravity reasons or privacy reasons, it's gotta remain in a particular data center or country or something like that, I think people, in general, wanna you need to minimize the movement of data, and you wanna be pushing the analytics to the data rather than pulling the data back. It's just too inefficient. Too much time is spent moving data around before you get any value from it. And so, you know, distributed cloud, in my view, is all about squeezing and achieving value for where the data lies. And I think when the volume starts to get really, really large, the idea of federating queries across will start to get expensive as well. And so I think that's an element that's gonna be interesting to see play out. Yeah. It definitely pays to think about the questions that you're trying to answer and being able to get a fragment of the answer across the different locations and then join those fragments rather than trying to join everything to build the entire answer at once? Well, because you start to get hit by e data egress fees from public clouds. So you gotta push down a lot of processing. You gotta do compression over wide area networks. So as obviously, as well as encryption, you there are a lot of technical things you need to think about to minimize to make it cost effective, essentially.
[00:23:27] Unknown:
And in terms of the query planner in Yellowbrick, is that something that you've started to think about as far as being able to offer some intelligence to, I'm trying to answer this question on this dataset, but I have copies of this data across multiple locations being able to push that down, get the sort of intermediate responses, and then join that back at the point of query?
[00:23:47] Unknown:
No. I mean, we're not looking at that. I'd say that's really in the realm of things like Denodo. We have a very, very sophisticated cost base optimizer within Yellowbrick, which can turn the most horrendous SQL queries into something, that's compiled that's rather quite beautiful. So we we're focusing on making Yellowbrick and its parser, planner, compiler, and all those pieces the most, efficient and fastest anywhere.
[00:24:12] Unknown:
Going back to your point earlier about the use of streaming data and optimizing for single row inserts as opposed to forcing people to batch up their inserts and updates. I'm wondering if you can talk through a bit of how the actual on disk representation is designed in order to be able to allow for those single row inserts without taking a performance penalty or without having to,
[00:24:36] Unknown:
you know, do large data reshuffles as you get to a certain tipping point of those individual, you know, inserts or updates? The way we've tackled that, again because we had the opportunity to start, you know, 6 or 7 years ago from a kind of a blank piece of paper, and knowing what the emerging near real time streaming use cases were coming. So what the team have put together is Yellowbrick as a hybrid row and columnar store data warehouse. And so you can think of data being stored in 2 ways. Streaming data comes in row by row, record by record into our row store component. And over time, it's aged out of there and compressed into our columnar back end MPP scale out. And so that's kind of how we handle it. Obviously, columnar databases are traditionally terrible at doing inserts.
So we do inserts via individual record inserts via the row store, and then those automatically get moved out, as I mentioned. But as far as a query is concerned, we unify that entire picture. And so we're transactionally consistent between our row store and our column store. And when we reach a certain point, a certain volume within the row store, as I said, records get just automatically moved out. So that's how we reconcile the need to have high performance bulk loading into the columnar store at the back end while at the same time being able to support out of the box subscription to a Kafka topic and stream record by record into YellowBrick. And so in terms of YellowBrick's capabilities,
[00:25:59] Unknown:
I'm wondering as you have worked with it and as you've worked with customers who are deploying it, what are some of the technical aspects of it that you personally find most intriguing or that you have dug deep into just from a point of personal curiosity?
[00:26:12] Unknown:
We could spend all day talking about some of the things that I've discovered and thought, wow, that's really impressive. But you know what? From a customer perspective, 1 of the things that stood out for me when I started first looking at Yellow Brick was the fact that there are no customer facing indexes that you need to define. All the indexing is handled at the back end. There's no management of indexes. You don't have to do things like gather statistics. All that's done automatically. And so the level of admin compared to a lot of data warehouses is really low. You know, we do have some things like when you define schemas, you can define what kind of data distribution you want to put out across a particular table, whether it's, you know, a hash based distribution or a large fact table or a replicated distribution for reference tables.
And we also have the notion of things like partitioning and sorting and clustering. But for those particular ones, in most cases, you don't need to bother with them. We don't take particularly strong advantage except in particular edge cases of those. And so what I'm really getting at is the DBA admin overhead for Yellowbrick is is very, very low, and and that really translates directly into cost savings for our customers.
[00:27:17] Unknown:
As somebody who is stepping in as the CTO and given the fact that the company is already a few years into their journey, I'm wondering what you have found to be some of the interesting challenges in getting up to speed with the architectural decisions that have been made and understanding how you can bring your own vision and ideas to help the product succeed both technically and in the current database market.
[00:27:43] Unknown:
What was really exciting, 1 of the things that attracted me to Yellowbrick was they were doing something that no other data warehouse vendor was doing. They were doing things in a way coming at that problem from a different angle, solving the shrinking of the data path problem, which is really what they've done at every level if you think about it. And that's that's really what a huge part of data warehousing is, just shrinking that data path. Having such a strong team there with a diverse set of backgrounds looking at this problem and then when I came in and started talking to customers and realizing that this thing actually does what it says on the tin, You know, it sounded almost too good to be true, and I've even had that response back from customers that I've been speaking to while I've been at Yellow Brick. They said, we thought this was too good to be true. When we POC ed it, we realized it really did do what it said, and it really does work. And so that is obviously very compelling to join. But I think, for me, a big part of my role here is to look beyond where we need to go next. And for me, it is about things like distributed cloud. It's about making our capabilities available more broadly on the public clouds directly and exploiting these technical advances on the new kind of instance types there.
But more than anything, it's about putting a compelling user experience around a data warehouse because data warehouses typically are not known for their user experience. And we all know where the customers I speak to, whether it's in the business units, the lines of business, the data analysts, or even the DBAs, the idea of doing self-service analytics, being a citizen data scientist is really, really important. The idea of, as a developer, making it easy to develop on a platform like this is important. And so for me, the user experience around Yellowbrick is something that's really important and where we continue to invest and build out. Yellowbrick manager is a first kind of exposure to the public of what we're doing. As we move over time, you'll see our user experience from end to end just get better and better. And building on the fact that this thing is incredibly fast and quick, you can get subsecond response times for many of your query workloads. And so if you're a developer against this, you can manage that iterative cycle of building reports and getting answers very, very quickly, deploying on your laptop, deploying in the public cloud ultimately, and things like this, which which for me is where we will see the main traction of our business grow, I think, in the future. I think that's what people in the business are looking for.
[00:30:07] Unknown:
Patrick is a diligent data engineer, probably the best in his team. He changed the syntax. He changed the schema. He gave it his everything. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day. Sarah from business intelligence says 5 minutes is way too long. John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday. Want to become the liberator of data?
Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources. No more waiting, no more huge expenses, and your hard work finally pays off. Firebolt is the fastest cloud data warehouse. Visitdataengineeringpodcast.com/firebolt today to get started, And the first 25 visitors will receive a free Firebolt t shirt. And to your point of user experience and shortening the iteration loop of developers and data analysts and data scientists, I think it's worth digging into a bit more, particularly because as you mentioned, you do have this fixed cost approach so that there isn't any concern of if I run this query 15 different ways or 15 different times, I'd have to think about, you know, how much is that actually costing. You know? Is this worth the cost of paying for the query to be able to get the answer even if it's not necessarily, you know, the best use of the time and energy? And I'm just wondering if you can talk through a bit of your experience or the experience of customers that you've worked with as far as coming to the realization of being able to unlock more of that experimentation and exploration mentality?
[00:31:52] Unknown:
You hit a nail right on the head there because it it is that they no longer have to stop and pause and think, can I actually afford to do this? Can I actually afford to run this? Or what instead Yellow Brick is giving them is the complete freedom now. You can just completely throw as much as you want at this. We're not gonna charge you for the amount of concurrent users, for the amount of queries, for the length of the queries, the number of resources taken. When you have a flat subscription that we provide, you can just eat as much as you want of it. And so I think it really does open up the possibilities for developers and data scientists to really just do their ad hoc analytics, their data exploration, and just to try out new use cases and what if scenarios in a way that they typically couldn't if you were constrained by a consumption model. We saw big customers of ours who really have grabbed hold of that opportunity with both hands. And 1 of our largest customers today essentially uses us with 4,000 registered users on a single system of data scientists and analysts to do things like anomaly detection over petabytes of data in real time, you know, interactive queries, And they can do that with Yellowbrick, and they couldn't afford to do that with their other solutions at all. So I think we're opening up huge opportunities at scale for these customers at a totally affordable price point.
[00:33:08] Unknown:
As you continue to work with customers, what are some of the capabilities of Yellowbrick that you find are often overlooked or underutilized that you think should be, you know, made more prominent or that people should pay more attention to or try to use in more of their workloads?
[00:33:24] Unknown:
Yeah. And you know what? There's 1 area and I think it's not necessarily that it's overlooked or underappreciated. I think the capabilities that we have for streaming ingest, for things like Kafka, aren't as used now as much as they will be. And I think that's actually as much to do with the use cases for true real time analytics right now in business and also what real time means to lots of different people. For some people, real time can mean, I just want my dashboard updated once a day. You know? Or for others, for example, in the retail industry where you're looking at customer satisfaction or call center operations where you want to try and intercept a customer making a call before they make that call. You track their kind of customer journey, and you decide what are the set of events that are likely to make that call, and you intercept that and you get ahead of them where on the button on the pulse insights are required. That's where we're seeing some interest. I mean, we some of our customers are using that streaming capability. And to give you 1 example is 1 of our customers called LexisNexis, and they have a product line called ThreatMetrix.
Now if you've ever purchased anything online that isn't through Amazon, so anything else, chances are that online transaction of your payment has gone through ThreatMetrix online portal, which a lot of retailers sign up to detect real time fraud on payment. And that's all back ended by Yellow Brick at the end of the day. So chances are you may have touched Yellow Brick already without knowing it, actually. So but they have very strict SLAs. They wanna be able to detect these kinds of fraudulent activities within kind of a 5 second time frame. They were using things like Impala before, which was getting nowhere near. They were talking minutes to get an answer out of that. And so that's 1 area where we're seeing some of these true streaming use cases come about here. But I think over time, the application of that particular capability is just gonna go through the roof.
[00:35:18] Unknown:
In terms of the actual adoption of Yellowbrick, can you talk through a bit about what's involved in actually getting it set up and deployed and integrated into a customer's infrastructure?
[00:35:28] Unknown:
Yeah. And, hopefully, I've built up a picture now that in many cases, given the Postgres compatibility so, you know, there are a lot of data warehouses out there already that are legacy that also had some component or based their dialects in terms of PostgreSQL. And so you think about Redshift. We have had customers for Redshift who've migrated to us very straightforwardly because it's a Postgres based database, same as Vertica, same as Greenplum, same as Netezza. They all have Postgres parses at the front end if you like. And so lifting and shifting or migrating from those kinds of sources is very, very easy. If you couple that with the fact that all of your ETL tools, all of your business intelligence tools work out of the box, the process of kind of inserting Yellowbrick into that workflow becomes a lot easier. And we can typically take your schema, your DDL.
It just translates over very easily. And in many cases, the workloads, which is always the hard part, it's always the applications, the heavy lift in any migration activity. Again, based on the Postgres kind of compatibility, that really does lower the barrier.
[00:36:33] Unknown:
Just as far as, like, deploying the actual database itself, though, I'm also curious in, you know, what's involved in that process of, you know, getting it set up and deploying it and actually managing it within your network environments or within your, you know, systems architecture?
[00:36:47] Unknown:
Yeah. So we actually offer Yellow Brick in 2 ways right now. You can actually consume Yellowbrick on subscription as a service through your public cloud, so via AWS or Azure. And what we do there is we manage everything at the back end. So we manage upgrades, backups, maintenance, all of that kind of thing. And then we set you up with a private link connection to your VPC in AWS, for example. And you just see an endpoint to Yellowbrick, and you get 10 gigabit link with 10 milliseconds latency access to it, and you can just go ahead and and zoom it in that way. Or we can deploy our specialized instance hardware in your own data center, and you can manage it yourself.
It's a really, really small footprint of system. It typically takes up less than half a rack just like sort of 12 14 u within your data center, which means it's low on power consumption, low on cooling, doesn't take up much space within the data center. And we've had instances where we've gone in and replaced kind of 6, 12 racks of Netezza with, you know, half a rack or less a quarter of a rack of Yellowbrick for those customers that wanna deploy it themselves. But going forward, we are gonna be launching software only versions of Yellow Brick that are deployed natively in AWS or Azure and then also natively on prem Kubernetes stacks like OpenShift, for example. So we're we're broadening out our deployment options this year and beyond.
[00:38:12] Unknown:
In terms of the ways that you've seen Yellowbrick used or the environments where it's deployed or the specific applications built on top of it, I'm wondering what are some of the interesting or innovative or unexpected ways that you've seen it deployed and used.
[00:38:24] Unknown:
Typically, we aren't used in business critical use cases. And so this is where consumption models don't work at all. You're always on environment like that fraud detection example I gave earlier. You're operating 247. You can't afford to go down. And so, you know, the high availability capabilities we have built into Yellowbrick and the asynchronous replication between Doctor sites that we have built in. All of that kind of comes into play. Where we're finding most interest is in a range of verticals, but financial services and insurance and and in telco are 3 of the biggest verticals that we're finding ourselves in. But we have plenty of examples in retail and health care and beyond as well.
I think 1 of the really interesting use cases that we find ourselves in, I think probably some of the ones in the telco space are quite interesting. For example, again, you probably don't know this, but if you're an AT and T or Sprint user of your phone, you know, when you roam around the US and you move from cell to cell, you're the cell that you're using at any 1 point in time may not belong to the company to which your own subscription belongs to. And these telcos have relationships between each other, roaming to go on, and they reconcile the costs as people roam between their different networks. And so what we have behind the scenes by 1 of our customers called Tioco is Yellowbrick running behind the scenes to reconcile all of these different intercarrier roaming charges every day. And so we're processing something like 40 to 500000000 records a day in Yellowbrick on a single system to reconcile all of this movement. And I I thought that was quite fascinating there again. What I love are the use cases where I don't realize I'm using Yellowbrick, but I am. And there's other use cases that stand out for me. But, you know, there are a whole bunch of just incredibly important use cases. Another 1 that's really interesting is is 1 of the largest insurance companies in the world standardized on Yellow Brick for their enterprise data warehouse, and they were doing it for their claims ratio processing and all those kind of, you know, just stuff that you do within the insurance business. But what was really interesting is that they're actually using Yellowbrick for their HR analytics. Such HR, but it turns out they've got 2,000,000 employees and associates across the world. And so their HR problem starts to blur into kind of a a significant data size problem as well.
[00:40:48] Unknown:
And in terms of your own experience of coming to Yellowbrick and acquiring the CTO role and starting to, you know, work through building the product and using the product and helping your customers understand the product? What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:41:06] Unknown:
What I think surprises a lot of people is is the speed at which you can just load and go with this thing because you don't have to worry about indexes and things like that. You can throw your data in there against schema and know you're gonna get pretty blistering performance. And 1 of the reasons for that is, as I mentioned, you know, we basically can address all of your data on SSD directly from the parallel CPU cores without having to go via main memory. And so what that means is we effectively act as an in memory database but over petabytes of data, and that's really the kind of performance characteristics that you get from it. We meet very, very few challenges here. I mean, usually, as with any data warehouse process, when you're looking at new vendors, it's usually around migration is a challenge. But what we found is that we've been able to get customers up and running on Yellow Brick in a surprisingly fast way. And and that's not all about Yellow Brick's technology.
It's a big part of it is that, but we have strong partnerships with migration partners who have automated tools and they have the skill sets. They've done this stuff over and over before to really shrink down the time to value from moving off your existing EDW onto Yellowbrick. So I think the challenges out there are mainly about migration. I can't really think of any technical challenges around Yellowbrick. I know for certain it's the easiest data warehouse I've ever come across in my career in terms of usage.
[00:42:29] Unknown:
As you were talking a bit more about the disk access and being able to work across petabytes of data potentially, I'm curious what the scale out process looks like for yellow brick as far as being able to scale across multiple disks or multiple CPUs, you know, whether that's something where you would scale vertically on a single machine or if it also has a horizontal scaling story.
[00:42:53] Unknown:
It's horizontally scaled. So what we have is Yellowbrick systems can start off at 3 nodes as a minimum, and they scale up in our current platform to 40 nodes. But each of those nodes has kind of 64 core AMD processes in, a ton of NVMe SSDs, and you can add them node by node as well without end any interruption in service. So we're an MPP scale out data warehouse, a shared nothing architecture. But when you add another node to your Yellowbrick system, we handle the data redistribution in the background. There's no interruption in your service when you do that, and so it's very, very easy to incrementally scale out this thing. And as I mentioned earlier, you can imagine the anatomy, the lifetime of a query. Query comes in from your BI tool like Tableau, hits our planner. We create a plan tree, a query execution plan out of that, which gets compiled down. And this compiled code gets deployed onto these workers on each of these individual nodes, which process that query against a part of the data on that local data.
That even includes complex joins and aggregations. We can do this in a highly concurrent environment with many concurrent users because we've also got a very sophisticated workload management scheme and capability in place there that will properly divide up the resources around memory and storage and CPU around using our system and prioritize different workloads. And so long answer to your question, but we just saw a scale out capability.
[00:44:21] Unknown:
As far as the actual distribution, so is it something where you have the sort of planner node where you have consistent hashing for being able to determine where the actual data sits on disk? Or is it you mentioned it's shared nothing. So is it more where each of the nodes is sort of read, write primary where it doesn't matter which instance you hit, you're able to read and write across the systems? Or do you have similar to the architecture where you have, like, a name node kind of thing where it's responsible for keeping track of where everything is? Or, you know, like with Mongo where it's the Mongo's process for being able to determine which instance has the appropriate shard that you're trying to work
[00:44:56] Unknown:
on? Yeah. It's all purely classic hash based processing. And so, you know, when you're looking up a particular point of data, you're generating a hash key, and that's mapping to a particular worker that owns that shard of data, for example, as well. So it's a kind of a very common way of executing the data. The worker just operates on its piece of data. So if you look at your query execution plan at the very leaves of the tree, these are the table scan nodes that access data on a particular worker, and this all feed up the tree to the root of the tree, which is the point at which you get your results set out. And they go through various join and aggregates and sorting and partitioning stages within the nodes within that graph because it's actually a graph rather than a tree. That's essentially how it works. Any node operates on its portion of the data, but, obviously, they're all interconnected at the back end. On our hardware instances, that's InfiniBand, you know, 200 gigabit connectivity between each worker.
And so when it needs to redistribute data as part of the query execution plan, it's very, very efficient to do that. It's a classic kinda shared nothing distributed query problem.
[00:46:00] Unknown:
And so for people who are looking to build out a data warehouse system and they're considering Yellowbrick, what are the cases where it's the wrong choice and they might be better suited with, you know, traditional data lake architecture or, you know, a different approach to data warehousing where maybe they wanna do something like, you know, ClickHouse or some other solution for being able to manage their data. You know, maybe they just put it all on a Postgres mode. I think that pretty much does characterize the boundaries
[00:46:27] Unknown:
here where we're the right choice. I mean, if you've got a relatively small amount of data, if it's less than a terabyte, if it's a few 100 gigabytes, then there are choices out on the market that would suit you just as well as Yellowbrick at that kind of small scale. It's when we get into the anything into the 1 to 10 to a 100 to petabyte scale is when Yellowbrick really delivers on the price and performance. And it is all about delivering 10, a 100 times performance at about a 5th of the cost of those large enterprise data warehouses. But what I would say is that a lot of our customers aren't necessarily starting at the petabyte scale. They could be adopting Yellowbrick in a departmental use case that's fairly small. And then what we see is the usage of Yellowbrick expand across the customer base. And almost 2 thirds of our customers, for example, have expanded their usage of Yellowbrick substantially since they acquired it, and we're seeing that across the customer base. They're satisfied, does what it says on the tin, and they're getting the value out of the analytics and out of the investments they're making there. As you continue to plan for the near to medium term of, you know, the the technical and product direction of Yellowbrick, what are some of the things that you have planned and that you're most excited to work on? We're fully getting behind with our partners around this distributed cloud vision. Distributed cloud isn't something that Yellowbrick will wrap up and take to market as a product. But what we see what I see as a blueprint is an architectural pattern that will emerge as these private as these public cloud rather stacks start to appear in different locations.
And with our adoption of Kubernetes and containerization as a way of deploying and managing and orchestrating Yellowbrick software anywhere in the public cloud, on premises, etcetera, at the network edge. That's where I think things get really exciting for us, and I think it's gonna set us apart from other vendors in this space because we will truly be able to say, we can run anywhere. We can satisfy use cases at the network edge in the smallest levels at a streaming data kind of use case all the way up to the standard classic centralized data warehouse at petabyte scale. We're in a position to provision instances and give you a user experience that is uniform across all of this geographically disparate deployment kind of topology that you could have.
I think that's really exciting to me, and I think it'd be really interesting to see how YellowBit could get applied to new IoT use cases, for example, that are gonna emerge here that I think other data warehouse technology and databases will struggle to adopt. So for me, it's all about we're keeping the foot to the floor in terms of the performance and the efficiency. We're making leaps and bounds in terms of improving the user experience of data warehousing overall, and we're doing that in a way that's maintaining that price and performance advantage. And those are the kind of 3 things that are are kind of driving our our road map forward.
[00:49:17] Unknown:
Are there any other aspects of the yellow brick project or the use cases that it's built for or your experience of helping to drive the technical direction for it that we didn't discuss yet that you'd like to cover before we close out the show? There were things we could talk about around the built in replication capabilities that we have that you can replicate data
[00:49:36] Unknown:
between geographically remote yellow brick instances for Doctor purposes. I'd love to talk a little bit more about how we do highly resilient data resilience data resilience and reliability. Many data warehouses very quickly, you know, in order to maintain high availability on a single instance, they replicate data, you keep 3 copies of the data like you would in Hadoop or many data warehouses mirror data. But we actually use erasure encoding to ensure that if we lose nodes within the system, we don't lose data, and and we can do that with a fraction of the overhead in terms of copies of the data than just simply doing brain dead mirroring. So I could probably go on all day, but I I'll leave it there. But if you wanna know more, I'd just say come to our website. We've got a 40 page white paper that goes into all of the technical intricacies around yellow brick, and we've just laid it all bare for everyone out there. So if you wanna know more information, check out that. Check out our movies and videos on there from our recent summit. They go into the technical details. You'll see demonstrations there. You'll even see the use of Denodo for federated queries as well. So, yeah, that would be my advice.
[00:50:49] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes, and we'll add links to all the things you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:06] Unknown:
I think the 1 of the biggest gaps is still around the upper layers around data cataloging, I I think, for me. I think I still haven't seen a solution that's adequate, especially when you start to think about distributed clouds, having your data separated in geographic locations. I think we're doing a and I think there's a gap a gap there. Yeah. Is getting harder and harder. And I think there's a gap a gap there.
[00:51:34] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Yellowbrick. Definitely very interesting project and 1 that I'm excited to see where you go with it. So thank you for all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Pleasure speaking with you. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Mark Cusack and Yellowbrick
Overview of Yellowbrick's Data Warehouse
Challenges and Opportunities in Distributed Cloud
Architectural Decisions and Innovations
Evolving Strategies and Market Realities
Data Modeling and Query Optimization
Technical Intricacies and Customer Use Cases
Capabilities and Underutilized Features
Deployment and Integration
Future Directions and Exciting Developments