StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macey. And today, I'm interviewing Sita Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patterns. So, Sita, can you start by introducing yourself?

Hey. Hello, everyone.

This is Sida. I am a product manager at Celad Data and also a

Star Wars contributor.

So hello, everybody. Thank you for tuning in today. And do you remember how you first got started working in data?

Yes.

So actually,

my whole career started actually with like operational research and actually applied mathematics. So I was doing a lot of like mathematical programming

and that is like natural to like machine learning optimization. Right? So I was doing, I was basically a data scientist, you know, back then. And I kind of realized that as we move along, right, algorithms

are always gonna

not really, you're not really gonna differentiate yourself, like, from algorithms. Right? Data is a foundation of everything. Right? For,

smaller models

to bigger models to LMs right now. Right? Data is always gonna be your differentiator. How do you manage your data? How do you have good governance on top of your data? Right? That's why I got really interested in database systems.

And that's, you know, how I end up, like, joining StarRocks and started actually working on OLAP database, StarRocks.

And so digging into StarRocks itself,

can you give a bit of an overview about what it is that you're building and some of the story behind how it got started and the objectives that you're focused on?

Yes. So StarRocks is a high performance,

Lakehouse Korea engine, right? That was

basically forked from Doris in 2020.

It's Apache licensed project that was donated to the Linux Foundation

in 2023.

Right? So basically in 2020, a handful of, Doris

maintainers, PMC members, we saw

a basically a challenge,

in the in the whole like analytics field, right? So basically it was the performance challenge and the access of pre competition challenge, right? People just wanted faster queries, right? But they wanted fast queries also with join queries

run on the fly, right? Every everybody wants to run things on the fly, you know, for the better flexibility,

right? So for the simplicity of the pipeline, right? So for the reason they don't have to build that whole thing, you know, to duplicate their data or those like pre computation, denormalization

pipelines.

So from the ground up, we built it, you know, a bunch of features, right, to support these use cases. We built it a

cost based optimizer,

which is really essential, you know, for scalable,

fast joining performance,

right? And we also

basically

vectorize all of our operator. You know, it's a C plus plus project, right? Doing that is reasonable, right? And with all of the SIMD optimization, and that is, you know, give us like a great foundation

of, you know, basically fast queries, fast OLAP queries, right? And we also build, you know, something like a primary key table, right? We have a

role level index. You know, that's for, like, subset second level,

data upstairs and deletes

directly on top of, columnar source. Right? So that's the whole thing. And around

2022,

we saw the rise of open formats, right, from Apache Iceberg. Basically, open table formats governed,

packet files. You can say it say it's something like that. Right?

So user

has been more,

they have more,

willing to to take ownership of their own data. Right? And this is more augmented by, you know, the recent rise of AI. Right? We don't know what kind of,

workloads that you wanna run on your own data. And we don't think there is

one tool that can take care of all of your data needs. Right? So now is the time for user to basically take ownership of their own data to store things in an open format so they can, you know, plug in tools that they want without actually duplicating their data to a proprietary system, which destroy

their data governance.

So now we've been working

on that

and basically to port all of the great performance or the low latency, high concurrency,

that customer facing like workloads and enable them on open formats, on lakehouse, on your shared storage.

So basically, that's the

rundown of the story that things we've built. We can dig into, you know, each one of them as we go.

As far as the overall market and ecosystem

for

query engines, data compute engines. It has exploded over the past few years. And even before that, there were numerous options on the market. As you mentioned, there is currently a bit of a

separation between lake house engines and the vein of things like Trino and Dremio and even Databricks.

And then there's the data warehouse engines

such

as Redshift and BigQuery and Snowflake.

And then we've got OLAP engines, which are focused on high throughput, high speed

for things like ClickHouse.

And and it seems like what you're building with StarRocks is

a little bit of a hybrid of all of those. And I'm wondering if you can talk to some of the ways that you think about the differentiating factors from StarRocks versus the broad competitive landscape?

Yeah. Sure. I think let's start with the last thing you mentioned. So basically,

like the click house, like the your your your Druid. Right? Those, like, real time OLAP databases that user use a lot for their like like internal like real time dashboards or, you know, customer facing like workloads, right? One of the biggest challenge,

for those systems

is that

when they designed the system, they didn't really have,

on the fly joy

in their mind, you know, when they designed the system, right? And that's a big problem because database management system, you know, we know today was like developed

in the like the 1970s,

right?

Relational databases has been the best practice since then. And somehow, like, just five, ten years ago, we just forgot about it, right? And say, okay, one big flat table,

is the best practice and it's really not. It probably solved your your performance challenges, but it really introduced a lot more, like, challenges. Right? Firstly, you have to, you know, to build denomized

table. Right? You basically you're duplicating your data. Basically, easily 10 x your storage. And, also,

you're building excessive

denormalization

pipelines. Right? Doing that for batch, you know, for, like, use of Spark, you know, but that's probably okay. But doing that in real time, right, you need to introduce stream processing. You need to introduce

incremental compute, you know, to catch up with the, you know, the data freshness requirement. Right? That's a whole another,

proprietary set of skill set you have to have. Right? And that's really expensive. That's really you need, like, some somebody special that actually codes, you know, to actually do that. Right? Build your, like, stream processing pipeline. Right? And also, demonization lock your data in a single view kind of format. Right? Basically, let's think about

you have a a smaller left table. Right? And you have, like, a bigger right table. Right? You have a, basically, dimension table and a fact table. And one day you want to update something, you know, on your dimension table, right, you just do that one one column, not that many rows is okay. Right? But imagine, like, with denormalization,

you have a big flat table underneath. Right? With the row number, basically, number of rows basically equal to your fact table and you still want to update that one column, guess what? You have to you know, you're stuck at data backfilling for, like, the next, like, three days.

And this is actually a huge problem, and we have a user that operates on petabytes of data. They're building a metrics platform. A metric platform, you should be able to add metrics easily, to change metrics easily, right? Because they were

on Apache Druid, they couldn't do join. They had to do denormalization.

So each schema change or each

edit or, you know, change in their metric took, like, two to three days, you know, just because of data backfilling. And that's, you know, not that's something that you don't want to have just to have, you know, good enough performance. And at StarRocks, we've been working on this problem since, like, you know, it was open source.

We basically

added so many features,

to the system that we make join actually run fast enough,

so you can ditch

your demonization pipeline. You don't have to rely on your demonization pipeline. Denimization should never be a default, even for all app system, even for customer facing alley. It should be something that you build on demand, right, to fix those crazy workloads.

Yeah. So that's basically to compare with,

the real time

OLAP databases.

And the next I will I wanna talk about is basically the Trinos, the Prestos,

right? The Dreameos,

those kind of, you know, those query engines that work on top of a shared

storage on a data lake, right? We see the data lake house kind of conversation started, you know, when the data breaks, when data breaks,

introduced Delta, when Netflix introduce Apache Iceberg. Right now, we can do a lot more, you know, basically on top

of Parquet files. So before Hive,

everything needs to touch disk. You know, you want to do

schema evolution, right? That needs to touch disk. Everything really relies on the HDFS,

right? And HDFS is really not that fast, right? So that created a lot of problem. And that's, you know, born the project like Apache Isenberg, Apache Hudi, and Delta Lake, Right? Basically, we're adding data warehouse like features, you know, to

Parquet files, to open formats. So why are we doing that? It's because now we can run all sorts of workloads, you know, directly

on top of one single source of truth data. Right? So, basically, you can introduce all sorts of tools and different kind of workloads,

like, not

having to move data around, not having to duplicate your data to 15 neural different copies and try to, like, synchronize all of them, and that's just impossible to do.

Right? And but one of the challenges we see, you know, in the field is that

although, you know, the storage is ready for a lot of the dashboard like workloads,

But a lot of the query engines that we have today are not really built for this kind of workloads. For example, let's say Trino, for example. Right? Trino is really known. You know, I learned Trino. It's an engine that really can connect to all kinds of data sources, right, and get okay performance. You can come connect to MySQL and you could still do predict and push that. Right? You can still get that okay performance, but the engine itself is not a %

optimized for the best possible performance for one,

data source

such as Apache Iceberg. Right? So it's very difficult to scale those systems, you know, to actually take over, you know, the kind of crazy workloads we run with, like, real time data warehouses, all app databases.

Right?

And that's why StarRocks exists. Right? StarRocks really bridge that gap. You know, we really try to,

bring that data warehouse like performance

directly to your lakehouse. Right? So you can run those, like, customer facing, like, workloads. You could run those, like, dashboards with thousands of people refreshing at the same time. Right? That kind of concurrency, right, we can handle. So basically complete the puzzle, you know, for the lakehouse kind of paradigm. Right? And this is the second, I think,

for the

for the query engines. Now we have other, you know, kind of cloud based, you know, data warehouses,

right? You think of

Snowflake, you think of Databricks, and you think of, maybe even Athena, Redshift, and BigQuery.

And those are great projects and different in their own ways. Right?

So,

what we do is we compensate, you know, what they don't do. Basically, it's low latency, high concurrency.

Right? Those system are really not designed to scale to that level, you know, to scale to, like, sub second level latency.

Right? So we basically complete that part of the tool.

And before we dig a lot more into different use cases, it's probably worth exploring the

architectural design of StarRocks and the ways that that architecture enables the different capabilities that you're focused on. So I'm wondering if you can just give a bit of a high level overview about how the system is implemented

and, some of the core, I guess, architectural ilities of scalability,

durability, etcetera, that are those core abilities that are foundational to the design and the premise of StarRocks.

Yeah. Of course. So StarRocks itself is, is is incredibly simple. I mean, the thing itself is has two kind of processes.

So there is,

Effie,

front end nodes. Basically, it's responsible for, like, like, taking the queries,

like re understanding the query, generating query plans and do metadata management.

Right? And also we have the B nodes or in our shared data mode that's called compute nodes. Right? Those are basically like the workers, you know, the workers of

the the Sara cluster, right? They're

responsible

for actually executing the query plan, right? The FE generates

and also

stores data

or adds like an IO, you know, to read, to scan data from external

shared storage, right? And architectural wise, I think some of the most important thing for StarRocks is first of all, is written C plus plus right? And we have, vectorized all of the operators.

So,

with SIMD, and that really give you the good, like, foundational good performance for OLAP, right? Because you you think about, like, like from the bottom, right, columnar storage. And that's basically storing things like a columnar format in column. You think of vectorization, that's basically, you know, doing things in memory,

doing it in big chunks, doing it in column. And also, you know, with SIMD, right, that's like processing

multiple data points, you know, with a CPU, with a single CPU cycle, right? That's batch, right? That's in column two, right? So basically, all that, the more batch you run, probably the faster you're going to go, Right? So this whole architecture is really

optimized,

for those like OLAP style workloads. So basically think of, you know, aggregations and joins, you know, group buys and high quality group buys and things. So this is the basic architecture

and StarRocks also, you know, on the planner side, it has a cost based optimizer.

Basically,

a StarRocks instead of based on like heuristic rules, right? StarRocks actually collects statistics

from your own data,

fresh statistics and use that to compute

the cost of each like candidate

query plan and use that as a base, you know, to basically to come up with the best possible query plan, Right? And that's really essential. That's probably the only way you can run high cardinality aggregation

and multi table joins correctly like today, right? That's the state of the art right now. It's basically a cost based optimizer.

And also that's basically

like an overview of the planner side. And also on the compute side, we feature

a massively parallel processing

kind of computer architecture, right? And that's basically

the heart of MPP, I think, is to, you know, is to utilize, you know, all of the nodes, right? When you when you do a computation, when you do execute

a query. I think the heart of it is basically

we're able to shuffle data between memory

while the query is running. Right? So the cool thing we can do is, with that is basically

think about a two phase high cardinality

aggregation. Right? We can do multi phase aggregation because, you know, we can actually like shuffle the data between nodes, like between memory, like during the query still running. And we can do shuffle drawing. That's basically the only way, you know, to

do like scalable

to big table drawing, right now. So that shuffling service is essential, you know, in architecture.

Yeah. So that's basically it, a rundown, you know, of the architecture. You can dig into, you know, each one, if you're interested. And you talk about, you know, like the foundation

of the system, right? So

the scalability

or the

whatever ability.

If I want to conclude

the

utilities, is that how you pronounce it? Utilities, right?

It would be, I think, for scalability.

Saros is really built

to scale vertically, scale vertically, and scale horizontally

without losing much performance, right, to cater to

the thousands of QPS

of concurrent

OLAP queries while also maintaining sub second performance.

And that's the scalability part. And the second I wanna say is

predictability

and reliability.

And this is really

we have a really big focus on those, like, customer facing kind of workloads. Right? In customer facing, like, having fast queries, yeah, it's important,

but most importantly is your query performance is predictable. Right? One

slow query, one very slow query is a lot worse than

all of the query being like medium level kind of slow. Right? So we build so many kind

of features and functionality optimizations

to ensure that your queries are not only fast, but they're also predictable

and reliable, right? And even your infrastructure fails, you know, underneath your EC2 instances are gone. Some of your

EC2 instances are gone. And we can still maintain that kind of performance. We can dive into that a little later in a bit. Right? And this is second, I think, is predictability

and reliability.

And the last one I wanna say is governability. I don't even know if that's a word, but governability, let's go with that. Right?

Basically is, okay, we talk about a lot of like like scalability, performance, being fast, being predictable, right? But now we want to

not run them in proprietary format. We want to enable them in basically in parquet files, in the Iceberg table format, in something that other

engines are familiar with as well, right? To

basically

allow you

to have a single source of truth of data,

right, while also having those great things

that, you know, about StarRocks,

to basically to build those integrations on top of open formats to work with our partners, right, to make that happen, to bring the governability,

if that's a word, to our to to the users.

One of the

challenges that always comes up when you're dealing with distributed data systems where you have databases

that can scale the storage across multiple nodes is the question of how do you deal with partitioning, sharding, replication?

How does that factor into to the ways that you think about structuring your queries, structuring your tables, etcetera? And I'm wondering

from a from, like, a scheme of design perspective,

how does that influence the ways that people think about what data to store, how to store the data, how to query the data when they're working with StarRocks, and how much of that does StarRocks take over so that you don't have to think about it as much. I'm thinking in particular

about with ClickHouse where you have to think about at the time of table creation,

what type of replication am I doing for this table? Because if I want to change it down the road, then I effectively have to replicate all of the data to a new table and if I wanna change the sharding key or something like that. Yeah.

Yeah. So, for that, I think that's more like a scalability kind of kind of question. Right? So when you scale with ClickHouse, you have to, like, manually shard. Right? For StarRocks, even the I mean, this is really not a problem for shared data because, like, all of the,

the data is persisted in your S3 buckets, right? So when you scale, you know, you only scale the compute.

If you don't think about the cache, you know, it's stateless. It doesn't persist any data. But for sure, nothing, you know, the sharding and all of the the data

distribution, are all automatic. So you don't really have to, you know, worry about,

basically,

redistribute it your data,

manually,

right, when you do want to scale your StarRocks shared nothing

cluster. Right? And in terms of like table design, I think this is a multilevel question too. So, yeah, we do have partitioning

and partitioning now is is automatic.

Right?

So

that's

a lot less thing that you would have to do, you know, compared to other kind of systems. Right? And also, I think this is more generally with OLAP. Right? I think OLAP right now, they we'd run everything, you know, in

in the columnar format, as I mentioned, you know, in the storage side, in the memory side, right? And this is actually more performant than, you know, doing indexes.

So

all of the actually, including ClickHouse,

all app system today,

don't really rely on indexes, you know, to have the baseline performance, right? And indexes are basically used for your kind of

specialized kind of workloads, right? You want to do like a massive count distinct with like trillions of rows. And that's when you want to build, you know, some kind of index, a bitmap index, right, to solve that kind of problem, secondary indexes,

right? So creating table in OLAP,

not only the effort we've done, but the effort like that's done like before, even before us. Right? But we did an event factorization. Right?

Table creation or the usage of OLAP has become like so much easier, right, over the past like five, ten years. Right? And we're on a mission, you know, to make it more even more accessible for all of the users out there.

Another aspect of the architecture

is by virtue of the

front end nodes being

a separate deployable unit from the compute nodes introduces the potential for network latency.

And even if you're deploying them collocated on the same hardware, it at least introduces

process boundaries.

Whereas for,

other database engines, there may be more fully vertically integrated.

Obviously, that helps with the scalability story and the maintainability,

deployability, all of those things. And I'm wondering what you see as the trade offs that are involved in that architectural decision of being able to separate it into

tiered layers,

both in terms of the the negative trade offs, but also some of the benefits that it enables. And I'm thinking in particular about the shared data layer capabilities of being able to query against the lakehouse formats while also being able to have that shared nothing layer of the data stored on disk?

Yep. So,

let's go with the good things first.

So first is the languages.

Right? We use different languages for different kind of purposes. Right? And FE and B,

you know, they have really big,

difference in purposes. Right? FE basically, you know, manage metadata

to generate query plans. Right? It's really not that

compute heavy. Actually, not compute heavy at all. Right? It's a lot of logics.

And for B, it's more

a lot of like like, parallel processing,

really heavy on the compute.

So for the two kind of purposes,

Fe is much more suitable to do in Java

and C plus plus is much more suitable.

B is more suitable to do in C plus plus Right? To separate the two, basically, we have we can use a language that's more suitable, you know, for that

particular

kind of use case. Right? And this is basically the biggest,

benefit that's out there. And talking about trade offs. Yes. There are trade offs. Network is a concern,

but there really not that much data being exchanged, you know, from

your FE and your b. No. B or c and notes. Right? That's just a query plan, the physical plan. Right? And maybe return the data, but, you know, we do that through f through BE too. So it's really not that much data to exchange. Right?

I think

there are two

limitations to this kind of design. So the first is

being the two written in different languages. Right? So if let's say we have a little function that we implement

in b, right, we write in c plus plus Right? We cannot really

do that function or call that function when only f is up. Right? So we have to, like, have the whole cluster up.

So, you know, this, you know, creates a little bit of trouble, you know, for us, in the on the deploy on the development side. Right? We have to think about, you know, we do actually do have two two two kind of languages and what do we put there? What where do we put where. Right? So this is basically, like, the first kind of limitation.

And second is,

this is really not that good

if your cluster is very small.

So, basically, FEMB are two kinds of processes. Right? And we cannot really handle, you know, the resource,

management between the two. Right? We have to, like, complete, like, depend on Linux to do that.

It doesn't really do that that well. Right? So if you want to deploy

in StarOS in production,

what you will want to do is,

to have dedicated machines,

for your FE nodes. Right? And but when you only need, like, three, two b nodes, sometimes one b node is enough for your for your deployment, and you want a high availability for your FE nodes. Right? You want three FE nodes and one PE node, then that's,

you know, a a a waste of,

resources when your class is very small. Right? But, you know, StarRise is really built for distributed compute

when you have, you know,

many of really big dataset and really big concurrency.

Right? So that is actually not that big of a problem too. Right? I hope that answer your question.

Yeah. And then now digging into

the use cases

that StarRocks enables, particularly around that

tiering of the

lakehouse shared data and the high speed OLAP shared nothing, You also have the intermediary bit of being able to cache the lake house data. I'm wondering if you can talk to some of the ways that that changes

the data platform architecture

and the overall

usage patterns that are

modified as a result of that capability versus

the current landscape of I either have my lake house or I have my OLAP engine, but I have to do some extra work to be able to join them together, or maybe I can use an iceberg table for, you know, in, like, my ClickHouse, but it's gonna be it's still gonna be really slow and just some of the ways that having that unification

of the stack

changes the overall approach the teams take to the design of their data systems?

Yeah. That's a that's a fantastic question. So, yeah, we do have, just a little bit of background information. Right? So for the for the listener that are not familiar with StarOS architecture.

So we do have our own table format and we do have our own file format. That's basically our internal tables. Basically, we talk about share nothing or shared data.

We we are all talking about, you know, the proprietary,

crazy kind of format that we built just for the best possible performance.

Right? You have your new low cardinality, you know, kind of like storage level, like optimizations.

Right? And you have all sorts of crazy indexes you can use. Right? And the option and the format itself is more

optimized for low latency, high concurrency. Right? And that's basically a share nothing and share data kind of thing you're you're talking about. And also we have Apache Iceberg, right? And Apache Hudi and Delta Lake and Apache Pyman. Right? Those are open table formats.

And

they govern ORC parquet files, and those are the open file formats. Right? And so how

the whole thing works. So basically,

the underlying thing is we want to preserve all of the things that are good

about

open table formats.

Basically is

single source of truth data, right, in the open format, right? So we want to with this as a

requirement

and how can we make the performance faster, Right? So actually, StarRocks is able to,

you can actually create materialized views with StarRocks, right, from Apache Iceberg table. And what it does is that it's basically a materialized view is basically a

StarRocks a managed StarRocks table in StarRocks' own crazy, own format, own really fast format. Right? So what you can do is create a materialized view. Right? And you can do some kind of,

data transformation too if it's needed. Right? And that becomes,

you know, StarRocks own kind of format. And also the whole process is managed. So you don't really have to, you know, think about refresh. You don't really think about, you know, is the data you sync. You know, if you configure it correctly, it's gonna work, and you don't really have to worry about that. Right? And, also, the metaverse view features a a automatic query rewrite. So, basically, if you do something like a pre aggregation or a denormalization with Star Atlas metaverse view, you don't really have to change your SQL query to benefit

from, you know, from from materialized view. StarOS cost

cost based optimizer can do can direct your query, you know, to the most appropriate

multires view, right, automatically, so you don't have to worry about that. Right? So that's basically a way we figure out, you know, how to,

enable this kind of, like, propriety level kind of, like, low latency, high concurrency,

still on the single source of truth data in open formats.

And by virtue of being able to use

one interface to query both your

shared nothing data sources where you're guaranteed to have high throughput, high speed,

and your lakehouse table formats.

Obviously, being able to join across them is gonna include introduce slowdown, but

it simplifies

the logical approach to how to actually work across that data as well as giving you the ability to,

for instance, materialize

or build new tables off of the lake house data into the shared nothing architecture layer so that you can consistently have high speed through that. So I'm thinking in terms of, like, the medallion architecture

style thing

pioneered by Databricks or the typical d b t flow of

staging intermediate mark where maybe you stage all of your data as iceberg files, but then in intermediate or in mark layer, you materializing those into the StarRocks shared nothing layer. I'm just wondering what you're seeing as some of the ways that people are approaching that style of workflow and some of the the benefits that they're seeing as a result.

Yeah. So yeah. That's a great question. And, actually, there are Matron is used basically

by, like, almost all of our users right now. Right?

So directly query, we can get great performance.

You can think about, like, around three to five times, you know, compared to Trino, right, with cache all for cache all. And but

Matura's view just take this to, you know, a whole other level. Right? This is basically a painless way for you to add,

denormalization pipelines or pre aggregation pipelines on demand. Right? The way I say on demand is because of the query reliability.

Right? So you really don't have to

worry about

whether this query is still gonna work after I build this pipeline. Let's think about you build something with Spark, right? Build a demoized table with Spark.

You have to change your SQL script or you reconfigure your entire dashboard, you know, for that thing to take effect, right? And that's really painful for our users, you know, before they move to StarRocks

because

you basically have to figure out all of the the pre processing pipeline before you actually do your application and run your application because you have to actually, like, go bother your user, you know, to change your SQL script. And that's nothing that anyone

wants to do. Right?

Yeah. So with this,

you can actually, you know, just build your application, right,

and figure out the performance afterward

and to add demonization

pipelines afterward. And you end up adding so many less

pre computation pipelines. Right? And also, you end up going to production a lot faster, you know, with

query rewrite capability of materialized view.

This is, you know, basically,

the main way of how

our users are

basically blending our proprietary kind of format,

optimized format with like an open format, like Apache Iceberg. Right? And there are actually other use cases too. You know, it doesn't only help with query. Our proprietary format has also has a primary key table, right, probably familiar with. It has a record level index, right, that's able to accelerate your data freshness to like sub ten seconds, you know, directly on top of columnar storage. And that's, you know, that has to do something with memory. So So that's not something that Apache Iceberg can do. So we actually have a user,

that use primary key table as the caching layer when they

basically ingest the data in. So it first lands in the primary key table. So you get that, you know, ten second data freshness.

And they build a whole mechanism

to basically sync the data from your from their primary key table

to Apache Iceberg, like, every hour. Right? Delete

and do a truncate of the table, you know, on top and make that whole, like, atomic operation. So that's a crazy workload. Right? So they actually presented this whole thing,

last year at the Iceberg Summit, as Tencent Games. So they were basically able to get sub ten second data freshness

on top of, like, petabytes of data. Right? And single source of truth still on Apache Iceberg

and running that with MetaMask View, running that off, like, like, sub second query query performance. It's just some crazy numbers If you come if you actually combine the two correctly. Right? This is that's probably, like, one of the most exciting ways you can use StarRocks, you know, I've seen.

And that was gonna be my next question. Was the reverse of landing your data in StarRocks for high rate throughput and being able to do read after rate querying in those other applications. So either using it for,

an application

storage layer, maybe not necessarily as the primary database, but for being able to push analytical events

similar to the use cases that ClickHouse is often deployed for,

and then being able to periodically

flush that data down to iceberg so that you maintain that as a raw storage table,

but you're also still able to use it for being able to power interactive use cases as the data gets rewritten?

That's a really exciting thing. I mean, that's not a problem that I think I think open table formats are looking to solve. Right? And because that's a lot of that is engine specific, you know, a lot of that depends on memory. Right? So, you know, that's one really innovative

way you can solve that problem, have that data freshness.

In terms of the lakehouse support, you mentioned

Iceberg, Delta, Hoody.

There are a large and growing number of table formats that are available.

Obviously, they all have different

use cases that they prioritize.

And I'm wondering

how the work that you're doing in

StarRocks

focuses on either subsets of that functionality or just the overall

problem matrix of supporting all of those tables and being able to support all of the features versus what are the the core elements of those formats that you care most about? And particularly as people ask you to add more and more tables to the list.

Yeah. That's a great question. That's a great question. So, for all of the table formats, I think almost all of them, I think of, you know, even like just Iceberg Delta, Paimon, and Hudi,

and even Les. Right? They all, like, born with similar reasons. They're all solving this very similar problem, which is Hive table just doesn't work anymore

with the scale we're operating on, with the frequency that we want to want to update our data. Right? The frequency we want to add new columns, to delete columns, you know, to do schema evolution. Right? It's not keeping up. And I think they always solve very, very similar problems,

but they started at different places. And I do see

a lot of the table formats converge,

you know, and have a shared functionality

a lot more than we see like two, three years ago. And yesterday at the iceberg,

Isober Summit, a lot of the features that are introduced like deletion vector, like, let's say, geospatial

kind of features,

they're shared

between Delta

and Iceberg.

So I

imagine

actually the opposite. So going forward, we should be able to see a lot less

proprietary

features, you know, that are in those table formats. And also what helps more is that we have innovative projects right now, such as Delta Uniform in Delta two point o and Apache x table. That basically, you know, is one copy of Parquet files, and it generates

three copies of the metadata. Metadata is a lot a lot a lot lighter, you know, so they can, you know, three

three copies of metadata

and one copy of the data file. So in that way, you know, we should be able to work with more like a unified

interface,

for our query engines. That's, tomorrow. That's,

that should come

that we really want to come, like, actually tomorrow.

And then for cases where teams already have some investment

in one of these other query engines, whether they're using Trino, Dremio, etcetera, for their lakehouse use case or they're using

Snowflake,

etcetera, for their warehousing or they're using ClickHouse

or Druid for OLAP or maybe they're using some combination of all of those. What are the situations where you typically see teams addressing StarRocks as a supplementary

tool versus a replacement of any or all of those?

A supplementary.

That's more I think we see a lot more of that, you know, with query edges. Right? Because

the heart of, you know, like, lake house is basically one cop one giant copy of data and 15,000 different tools on top of it using that one one copy of data without moving moving the data. And that really sparks, you know,

innovation on our side because that

really drives

specialized

tools. And StarRocks is that type of specialized tools on the lakehouse. Right? And we're probably the only one out there that is really hyper focused on high concurrency, low latency,

reliability,

you know, those kind of like customer facing, like, workloads. So we see do see a whole ton of StarRocks running site on the side of, Apache Spark.

So Spark do a lot of the the business transformation,

do a lot of, like, the compaction. Right? And we handle a lot of, like, the interactive queries and we handle a lot of the, like, the high concurrency, like, the dashboards and the reports. And

also, you know, like customer facing, like,

application that they're directly serving

all that queries to their customers and run everything on the fly. So that is probably the biggest.

And, also, we do see a lot of ClickHouse user, but that's more often a direct replacement.

And they, you know, they they just look at look at their, like like, 10,000 denormalized table and she get frustrated and look at their build. They, you know, they cry. Right? And they wanna find something that can run joints on the fly, and, you know, they find us. And later, they discovered, you know, even more

reason, you know, to stay on Star Wars because of the scalability

and because of the data upstairs, because the real time kind of, you know, real time data upstairs, you can actually, you know, have your data change. What a surprise. Right?

Like, in real time. So they end up, like, slowly, slowly, like, replacing the whole system. But denormalization

is

basically most of the reason that a lot of the ClickHouse choose StarRise in the first place.

Digging more into the ClickHouse comparison,

one of the things that gives them a built in advantage at the moment is the overall ecosystem investment around it, where they've been around for a while, they have gained a good reputation.

There are a lot of systems that presume ClickHouse as their storage layer. I'm thinking even in particular of some very new systems such as,

AI gateways that rely on ClickHouse as the storage layer for storing some of user interaction data, telemetry data, etcetera.

And I'm wondering

in terms of that piece of the puzzle for StarRocks, how you're approaching the

ease of integration, ease of adoption from a protocol and ecosystem layer. Protocol and ecosystem, we definitely are spending a lot more of our effort on those. But I think in our experience, that's really not a a huge blocker for a lot of the users because

a lot of things they're experiencing when they're going trying to go to production, when they're trying to scale, and those problems are much more difficult to solve than a lot of the ease of, you know, a lot of the,

the higher level kind of issues. And Star Wars is not really exactly

yeah, we do lack when compared to ClickHouse, you know, functions. But we are working with the community, you know, to make Star Wars a much more a mature product.

And digging a bit further into my brief mention just now of the AI ecosystem,

obviously,

that's another area that has been driving a lot of development in terms of completely new data engines as well as feature additions to existing ones.

And I know that StarRocks has an array data type which can be used for storing vectors. And I'm just curious what the current stance is of StarRocks in that broad ecosystem

of AI applications and being able to enable them and deal with vector embeddings, etcetera.

Yes. So I think I think it's it's better it's it's good for me to to give, like, a general picture and then, like, some features, explanations.

Right? So so first, forgive me to going back to Lakehouse again, but this is really Lakehouse. So, Lakehouse,

they did that a long time ago, right, since 2017, '20 '18. It didn't really become, like, that popular, that big

until just a few years ago. Right? The reason for that is I think AI is really a big push, you know, for the lakehouse adoption or the lakehouse kind of like conversation.

Because like with data warehouse, with, you know, just think about like data warehouse

kind of workloads. Five, ten years ago, it's probably

possible that one system can take care of, you know, all of your needs. Right? But with AI just right around the corner, nobody knows what's gonna come tomorrow. Right? What kind of new tools I'm gonna have to use the next day?

So user taking ownership

of their own data to store their data in parquet files instead of, let's say, Staros file format, right, has become a lot more important in the past five years just because of the, you know, the boom of AI,

machine learning kind of workloads. And that's, I think, a lot of the reason why we started

to build this whole, like, lake house integration. Right? Because we do want our user to have their own data, to own their own data, and to have the freedom, you know, to choose what kind of tools they can use tomorrow. So that's more like the the lakehouse kind of conversation. And what makes lakehouse even more important today, I think, is is the the the birth of agents, analytical agents. Agents are they work a lot faster. They don't really have to sit there and think. You know, they can go through like 10,000 tokens, like a few seconds. Right? Having a universal

view

of your entire

dataset inside of your organization is something that's very important

for AI agents. So you want to have that very strong

single source of truth

kind of data layer inside of your organizations

to better leverage, you know, AI agents. Right? All sorts of agents that is not only, let's say, analytical type of agents. And that I think, you know, give it even bigger push. That's the whole like lake house thing, right? I think just talking about agents, you know, agent is actually a lot of users in our community already building analytical agents

with StarRocks. I can't really say who yet because they're still figuring it

out in the early phase. Right? But we do see a lot of very exciting new projects at Como. I think StarRocks really pair well

with agents because with agents, you have to have agent scale kind of concurrency. Nobody know, like, how big that's gonna be. And low latency because we're gonna become, like, even bigger

kind of a requirement because they work just so much faster than I do. And also you have automation

in your pipeline. You have less human in the loop. You need your query engine to be able to self optimize, right, to not have that many knobs you have to talk. You you have to you have to move, right. You have to be able to learn from,

the queries that's previously

executed

to constantly evolve and to make sure that your query plans are always, you know, the most optimal. And we do have that future building, by the way. And I think just a characteristic

of StarRocks

is really good to solve a lot of the agent problem,

that we see

in the industry today. Right? And so that's basically, you know, the bigger

thing, right? The more forward looking thing. And feature wise, you know, yes, Tencent Games, did contribute the vector index

functionality.

Right? So you can use StarRocks to basically

do a rack, right, or to do similarity search as Tencent Games.

Tencent do they think they do like a reverse image search. And you can do

a multi you can do like a hybrid search, right, because we are so good at SQL query. Right? You can do a lot more, you know, with your with your vector search with StarRise. And that feature is is available if you want to try that out today.

And also to better

incorporate

in

the whole

AI ecosystem,

we really strengthened our Python integration. Right now we have Arrow support, now we have a Python UDF,

and we're gonna have a lot more, you know, going to the future. That answer your question.

Absolutely. Yeah. Definitely a lot of exciting stuff to dig into. And in your experience

of working on this project, working in the community and with your customers at cellular data, what are some of the most interesting or innovative or unexpected ways that you've seen StarRocks applied?

Okay. So,

let's think, you know, the first one has to be has to be Tencent Tencent Games, right, with their crazy kind of, you know, subtesting and data freshness directly on top of Apache Iceberg with petabytes of data sub second latency.

Right? That is that is a very innovative way, you know, of utilizing

our our format with open formats as the lakehouse, right, innovator.

And something

also is, I think, a lot more interesting

is lakehouse,

StarRocks based, or StarRocks powered lakehouse, or StarRocks and Apache Iceberg are actually solving those problems nobody

has believed that is possible. Like customer facing analytics,

right? That's something that

requires very low latency, high concurrency and

stability

of your latency,

reliability

of your latency. Right? We have users that use StarOS and Apache Iceberg in production,

serving

their customers, their paying customers. Right? So first example can be

let's think about Herdwatch.

Herdwatch is actually

a customer. They build like customer facing dashboards for cows, for like livestocks.

They have like tens of millions of livestock they have to monitor

for their

customers, which are, you know, farms. And they use StarRocks and Apache Iceberg

to,

to basically,

replace Athena, right, and brought down, create latency from like tens of seconds to like sub second. At TRM Labs is is another one,

I've,

I've mentioned too. Right? And we have a lot more use cases such as customer phasing, such as Pinterest. Pinterest use us to, you know, do that kind of you think of, like, YouTube analytics page. Right? I'm sure you're familiar with that. So those for, like, content creators and for advertisers, you'll see

in real time how their ads are running, right, to make a real time decision on those. Right? With that and with Coinbase, a customer phasing,

just a a bunch more. We have a bunch more case studies on the starus.io

website. You can check it out. And it's one thing I wanna wanna mention, unexpected.

I did just talk to a user last week. We have that primary key table that can do real time data ops search, but it's not a TP database.

That guy is not doing OLAP. That's that guy's using StarOps as a transactional database, a transactional database only

to serve his operational workload. It's been working fine for, like, a year and a half. I just heard about that, like, last week. Yeah. That's probably the most, like, unexpected way. You know, people use it. You probably shouldn't, but, you know, it's it's it's good to see. It's always fun to see.

And in your experience of working on this technology, working with this community, and in this problem domain, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I think challenge wise, we've been through so many difficult things. I mean, building our cost based optimizer was a lot more difficult than we thought. It's basically an optimization problem. It's basically, you know, we do need to work with the community a lot in the first, like, two years

to make it

basically, we can confidently say, you know, a year or some change to confidently say that we can turn it on by default. And thank God we're open source or else a lot of the features that we built today would not be possible

without

our initial adopters,

trying out and showing us the crazy stupid query plans that our CPO has generated.

Right? Without those, you know, that wouldn't be possible.

And I think the biggest challenge that we see is basically a scale challenge. We do test our system really rigorously.

Right? We think terabytes of data that's that's that's big, 10 terabyte of data that's big. But once we get put into production systems, when 10 terabyte is not the data size, it's a scanned data size after pruning,

right? And that's very difficult, right? With hundreds

of StarRocks instances with like each with like fifty, fifty plus cores. And a lot of the problems do show up. You know, you see a lot more bottleneck when you see you find new bottlenecks in your system, you know, when you push a system to that kind of scale, right? For example,

when there's a data skew, right? Shuffling can be, you know, can be a huge challenge. And that's a problem we solve for like, you know, we spent like almost

half a year to solve, Right? We didn't even think that was there, you know, in the first place. And that's also, you know, thanks to our to our open source, you know, us being open source and our open source users. You know, we've been iterated for the past, like, five years

to make this system, you know, as stable and as performant and as reliable today.

For people who are interested in

simplifying their data stack or they really care about speed and latency, what are the cases where StarRocks is the wrong choice?

Yeah. We have to talk about that guide again. Yeah. Don't use it as a transactional

database. There are better options out there. If you want to, you know, analyze your transactional database,

read the CDC and use something like a, Debezium, right, to CDC it into StarRocks, and we support updates. So, you know, that works. Don't use it as a transactional database. That's probably not your best bet. And also for very large ETL kind of workloads

that runs for, like, hours or days, you're better off probably using something like Spark,

something that's designed for those kind of workloads, especially, you know, if you're on a data lake. You can just run the tools, you know, simultaneously

on top of one one copy of data.

So

yeah. So that's basically it.

And as you continue to build and iterate on the StarRocks

engine and the ecosystem around it, what are some of the things you have planned for the near to medium term or any particular projects or problem areas or or capabilities that you're

excited to explore?

Yeah. So this year, we're really looking to to further

stabilize, you know, the query performance,

not only just for your for our own, like, shared data or shared nothing kind of architecture. Right? But more for like Apache Iceberg. Right? So how

to deal with,

unmaintained

or less maintained

iceberg metadata files, right, how to better read them parallelly, right, how to better cache them, right? How to make that metadata cache more reliable, right? And also how to make our own data cache more reliable. We've already built a lot of features just to solve that specific problem. And

we're going to work with the community

and to make those features more mature and more robust, right? So you can end up running

the craziest kind of workloads

on your open data, on your own data. And this is exactly what we want in the future, not only for your data

kind of workload, but only also for your AI. So

we take apart, you know, into that big evolution,

making everything open. And that is really what we're looking for. Further power customer facing analytics on that open type of format.

Are there any other aspects of StarRocks, this overall space of high throughput,

high speed queries across data lake, and,

shared nothing data layers and the

capabilities that it enables that we didn't discuss yet that you'd like to cover before you close out the show? Yeah. There are a lot more. I mean, I think one aspect is

is caching, if you're interested in, we can talk about. Right? Caching is something that we've built actually pretty long time ago for for our shared data architecture.

Right? Actually, our first iteration of our caching was

it already looked really good on benchmark. That that's why I don't really like like benchmark that much because it doesn't show the full picture. Right? If you run everything hot on the benchmark, which everybody do, it doesn't show, like, actually how how well your cache, you know, can hold hot data. Right? But in customer facing analytics, that's, like, the most important aspect of of a cache, how to prevent

your entire cache getting flushed by a big scan

and how to make sure that if

some kind of a service triggers

a query, you make sure that, you know, there's cache there. You're not reading off S3, like a hundred milliseconds

per scan. That's not acceptable. Right? So we build a whole bunch of things, you know, just for the cache itself, like a segmented cache. You know, we have a mechanism to protect

hot data, you know, from it being, like, scanned by,

evicted by BigScan or BigQuery. And we have compute replica. So basically, we have multiple replica of the same cache

stored in different

nodes that if one of your

EC2 instance fail, you still have your performance SLA. Because, like, for customer facing, we actually see a lot of performance SLA higher than more strict actually than EC2 instances,

availability of EC2 instances. And, you know, it's just a lot of things. And I think for cache, you know, there's another thing. Like, for example, like, cache warm up. We have a warm up mechanism for cache. That's really not really that conventional.

You know, you see in those, like, interactive query engines because, like, stability of ins of latency was is really not that much big of a deal in those systems. But for customer facing, we cannot have, we cannot afford, you know, to have any kind of slow queries. And that's just caching. And we build so much more stuff, you know, to to help you stabilize your query. You can, you know, go to our release note, and we have a bunch of features that just got released, you know, the three dot two release oh, the three dot four release that you can check out. That's basically all I have. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the StarRocks team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today?

There's so many gaps.

There's so many gaps. I think the biggest gap right now is the requirement from the customers changed so fast. You know, from just a cloud data warehouse,

is good enough for me

to I want to run everything on Parquet files. You know, this is a big shift. But a lot of the tools that we're using today were developed, like like, five, ten years ago when this requirement was not a thing. Just think about, like, in our field, query engines.

We we weren't, like, thinking about, like, building the fastest query engine for Hive for Hive table. Right? And back then was, you know, like, connecting to all of your data sources and get okay performance

or ingest into a proprietary format to get the best kind of performance. Now we want both, and the tools are not here yet. And we try to close that gap, but there are more gaps in this field, you know, just this

transition

of, you know, to open formats. So I think this is definitely,

the biggest gap we see right now. Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing on StarRocks and the capabilities and use cases that it supports. Definitely a very

interesting project,

very enticing technology that I intend to spend more time investigating. So I appreciate the time and energy that you and the rest of the team put into that, and I hope you enjoy the rest of your day. Thank you. Thank you. Appreciate that. Thank you for listening. Thank you.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.