Summary
In this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patterns
- Introduction
- How did you get involved in the area of data management?
- Can you describe what StarRocks is and the story behind it?
- There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?
- Can you describe the architecture of StarRocks?
- What are the "-ilities" that are foundational to the design of the system?
- How have the design and focus of the project evolved since it was first created?
- What are the tradeoffs involved in separating the communication layer from the data layers?
- The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?
- The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?
- StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi, etc.), which overlaps with use cases for Trino/Presto/Dremio/etc. What are the cases where StarRocks acts as a replacement for those systems vs. a supplement to them?
- The other major category of engines that StarRocks overlaps with is OLAP databases (e.g. Clickhouse, Firebolt, etc.). Why might someone use StarRocks in addition to or in place of those techologies?
- We would be remiss if we ignored the dominating trend of AI and the systems that support it. What is the role of StarRocks in the context of an AI application?
- What are the most interesting, innovative, or unexpected ways that you have seen StarRocks used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on StarRocks?
- When is StarRocks the wrong choice?
- What do you have planned for the future of StarRocks?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- StarRocks
- CelerData
- Apache Doris
- SIMD == Single Instruction Multiple Data
- Apache Iceberg
- ClickHouse
- Druid
- Firebolt
- Snowflake
- BigQuery
- Trino
- Databricks
- Dremio
- Data Lakehouse
- Delta Lake
- Apache Hive
- C++
- Cost-Based Optimizer
- Iceberg Summit Tencent Games Presentation
- Apache Paimon
- Lance
- Delta Uniform
- Apache Arrow
- StarRocks Python UDF
- Debezium
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey. And today, I'm interviewing Sita Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patterns. So, Sita, can you start by introducing yourself?
[00:00:59] Sida Shen:
Hey. Hello, everyone. This is Sida. I am a product manager at Celad Data and also a Star Wars contributor.
[00:01:08] Tobias Macey:
So hello, everybody. Thank you for tuning in today. And do you remember how you first got started working in data?
[00:01:14] Sida Shen:
Yes. So actually, my whole career started actually with like operational research and actually applied mathematics. So I was doing a lot of like mathematical programming and that is like natural to like machine learning optimization. Right? So I was doing, I was basically a data scientist, you know, back then. And I kind of realized that as we move along, right, algorithms are always gonna not really, you're not really gonna differentiate yourself, like, from algorithms. Right? Data is a foundation of everything. Right? For, smaller models to bigger models to LMs right now. Right? Data is always gonna be your differentiator. How do you manage your data? How do you have good governance on top of your data? Right? That's why I got really interested in database systems.
And that's, you know, how I end up, like, joining StarRocks and started actually working on OLAP database, StarRocks.
[00:02:07] Tobias Macey:
And so digging into StarRocks itself,
[00:02:10] Tobias Macey:
can you give a bit of an overview about what it is that you're building and some of the story behind how it got started and the objectives that you're focused on?
[00:02:18] Sida Shen:
Yes. So StarRocks is a high performance, Lakehouse Korea engine, right? That was basically forked from Doris in 2020. It's Apache licensed project that was donated to the Linux Foundation in 2023. Right? So basically in 2020, a handful of, Doris maintainers, PMC members, we saw a basically a challenge, in the in the whole like analytics field, right? So basically it was the performance challenge and the access of pre competition challenge, right? People just wanted faster queries, right? But they wanted fast queries also with join queries run on the fly, right? Every everybody wants to run things on the fly, you know, for the better flexibility, right? So for the simplicity of the pipeline, right? So for the reason they don't have to build that whole thing, you know, to duplicate their data or those like pre computation, denormalization pipelines.
So from the ground up, we built it, you know, a bunch of features, right, to support these use cases. We built it a cost based optimizer, which is really essential, you know, for scalable, fast joining performance, right? And we also basically vectorize all of our operator. You know, it's a C plus plus project, right? Doing that is reasonable, right? And with all of the SIMD optimization, and that is, you know, give us like a great foundation of, you know, basically fast queries, fast OLAP queries, right? And we also build, you know, something like a primary key table, right? We have a role level index. You know, that's for, like, subset second level, data upstairs and deletes directly on top of, columnar source. Right? So that's the whole thing. And around 2022, we saw the rise of open formats, right, from Apache Iceberg. Basically, open table formats governed, packet files. You can say it say it's something like that. Right?
So user has been more, they have more, willing to to take ownership of their own data. Right? And this is more augmented by, you know, the recent rise of AI. Right? We don't know what kind of, workloads that you wanna run on your own data. And we don't think there is one tool that can take care of all of your data needs. Right? So now is the time for user to basically take ownership of their own data to store things in an open format so they can, you know, plug in tools that they want without actually duplicating their data to a proprietary system, which destroy their data governance.
So now we've been working on that and basically to port all of the great performance or the low latency, high concurrency, that customer facing like workloads and enable them on open formats, on lakehouse, on your shared storage. So basically, that's the rundown of the story that things we've built. We can dig into, you know, each one of them as we go.
[00:05:18] Tobias Macey:
As far as the overall market and ecosystem for query engines, data compute engines. It has exploded over the past few years. And even before that, there were numerous options on the market. As you mentioned, there is currently a bit of a separation between lake house engines and the vein of things like Trino and Dremio and even Databricks. And then there's the data warehouse engines such as Redshift and BigQuery and Snowflake. And then we've got OLAP engines, which are focused on high throughput, high speed for things like ClickHouse.
And and it seems like what you're building with StarRocks is a little bit of a hybrid of all of those. And I'm wondering if you can talk to some of the ways that you think about the differentiating factors from StarRocks versus the broad competitive landscape?
[00:06:12] Sida Shen:
Yeah. Sure. I think let's start with the last thing you mentioned. So basically, like the click house, like the your your your Druid. Right? Those, like, real time OLAP databases that user use a lot for their like like internal like real time dashboards or, you know, customer facing like workloads, right? One of the biggest challenge, for those systems is that when they designed the system, they didn't really have, on the fly joy in their mind, you know, when they designed the system, right? And that's a big problem because database management system, you know, we know today was like developed in the like the 1970s, right?
Relational databases has been the best practice since then. And somehow, like, just five, ten years ago, we just forgot about it, right? And say, okay, one big flat table, is the best practice and it's really not. It probably solved your your performance challenges, but it really introduced a lot more, like, challenges. Right? Firstly, you have to, you know, to build denomized table. Right? You basically you're duplicating your data. Basically, easily 10 x your storage. And, also, you're building excessive denormalization pipelines. Right? Doing that for batch, you know, for, like, use of Spark, you know, but that's probably okay. But doing that in real time, right, you need to introduce stream processing. You need to introduce incremental compute, you know, to catch up with the, you know, the data freshness requirement. Right? That's a whole another, proprietary set of skill set you have to have. Right? And that's really expensive. That's really you need, like, some somebody special that actually codes, you know, to actually do that. Right? Build your, like, stream processing pipeline. Right? And also, demonization lock your data in a single view kind of format. Right? Basically, let's think about you have a a smaller left table. Right? And you have, like, a bigger right table. Right? You have a, basically, dimension table and a fact table. And one day you want to update something, you know, on your dimension table, right, you just do that one one column, not that many rows is okay. Right? But imagine, like, with denormalization, you have a big flat table underneath. Right? With the row number, basically, number of rows basically equal to your fact table and you still want to update that one column, guess what? You have to you know, you're stuck at data backfilling for, like, the next, like, three days.
And this is actually a huge problem, and we have a user that operates on petabytes of data. They're building a metrics platform. A metric platform, you should be able to add metrics easily, to change metrics easily, right? Because they were on Apache Druid, they couldn't do join. They had to do denormalization. So each schema change or each edit or, you know, change in their metric took, like, two to three days, you know, just because of data backfilling. And that's, you know, not that's something that you don't want to have just to have, you know, good enough performance. And at StarRocks, we've been working on this problem since, like, you know, it was open source. We basically added so many features, to the system that we make join actually run fast enough, so you can ditch your demonization pipeline. You don't have to rely on your demonization pipeline. Denimization should never be a default, even for all app system, even for customer facing alley. It should be something that you build on demand, right, to fix those crazy workloads.
Yeah. So that's basically to compare with, the real time OLAP databases. And the next I will I wanna talk about is basically the Trinos, the Prestos, right? The Dreameos, those kind of, you know, those query engines that work on top of a shared storage on a data lake, right? We see the data lake house kind of conversation started, you know, when the data breaks, when data breaks, introduced Delta, when Netflix introduce Apache Iceberg. Right now, we can do a lot more, you know, basically on top of Parquet files. So before Hive, everything needs to touch disk. You know, you want to do schema evolution, right? That needs to touch disk. Everything really relies on the HDFS, right? And HDFS is really not that fast, right? So that created a lot of problem. And that's, you know, born the project like Apache Isenberg, Apache Hudi, and Delta Lake, Right? Basically, we're adding data warehouse like features, you know, to Parquet files, to open formats. So why are we doing that? It's because now we can run all sorts of workloads, you know, directly on top of one single source of truth data. Right? So, basically, you can introduce all sorts of tools and different kind of workloads, like, not having to move data around, not having to duplicate your data to 15 neural different copies and try to, like, synchronize all of them, and that's just impossible to do.
Right? And but one of the challenges we see, you know, in the field is that although, you know, the storage is ready for a lot of the dashboard like workloads, But a lot of the query engines that we have today are not really built for this kind of workloads. For example, let's say Trino, for example. Right? Trino is really known. You know, I learned Trino. It's an engine that really can connect to all kinds of data sources, right, and get okay performance. You can come connect to MySQL and you could still do predict and push that. Right? You can still get that okay performance, but the engine itself is not a % optimized for the best possible performance for one, data source such as Apache Iceberg. Right? So it's very difficult to scale those systems, you know, to actually take over, you know, the kind of crazy workloads we run with, like, real time data warehouses, all app databases.
Right? And that's why StarRocks exists. Right? StarRocks really bridge that gap. You know, we really try to, bring that data warehouse like performance directly to your lakehouse. Right? So you can run those, like, customer facing, like, workloads. You could run those, like, dashboards with thousands of people refreshing at the same time. Right? That kind of concurrency, right, we can handle. So basically complete the puzzle, you know, for the lakehouse kind of paradigm. Right? And this is the second, I think, for the for the query engines. Now we have other, you know, kind of cloud based, you know, data warehouses, right? You think of Snowflake, you think of Databricks, and you think of, maybe even Athena, Redshift, and BigQuery.
And those are great projects and different in their own ways. Right? So, what we do is we compensate, you know, what they don't do. Basically, it's low latency, high concurrency. Right? Those system are really not designed to scale to that level, you know, to scale to, like, sub second level latency. Right? So we basically complete that part of the tool.
[00:13:00] Tobias Macey:
And before we dig a lot more into different use cases, it's probably worth exploring the architectural design of StarRocks and the ways that that architecture enables the different capabilities that you're focused on. So I'm wondering if you can just give a bit of a high level overview about how the system is implemented and, some of the core, I guess, architectural ilities of scalability, durability, etcetera, that are those core abilities that are foundational to the design and the premise of StarRocks.
[00:13:32] Sida Shen:
Yeah. Of course. So StarRocks itself is, is is incredibly simple. I mean, the thing itself is has two kind of processes. So there is, Effie, front end nodes. Basically, it's responsible for, like, like, taking the queries, like re understanding the query, generating query plans and do metadata management. Right? And also we have the B nodes or in our shared data mode that's called compute nodes. Right? Those are basically like the workers, you know, the workers of the the Sara cluster, right? They're responsible for actually executing the query plan, right? The FE generates and also stores data or adds like an IO, you know, to read, to scan data from external shared storage, right? And architectural wise, I think some of the most important thing for StarRocks is first of all, is written C plus plus right? And we have, vectorized all of the operators.
So, with SIMD, and that really give you the good, like, foundational good performance for OLAP, right? Because you you think about, like, like from the bottom, right, columnar storage. And that's basically storing things like a columnar format in column. You think of vectorization, that's basically, you know, doing things in memory, doing it in big chunks, doing it in column. And also, you know, with SIMD, right, that's like processing multiple data points, you know, with a CPU, with a single CPU cycle, right? That's batch, right? That's in column two, right? So basically, all that, the more batch you run, probably the faster you're going to go, Right? So this whole architecture is really optimized, for those like OLAP style workloads. So basically think of, you know, aggregations and joins, you know, group buys and high quality group buys and things. So this is the basic architecture and StarRocks also, you know, on the planner side, it has a cost based optimizer.
Basically, a StarRocks instead of based on like heuristic rules, right? StarRocks actually collects statistics from your own data, fresh statistics and use that to compute the cost of each like candidate query plan and use that as a base, you know, to basically to come up with the best possible query plan, Right? And that's really essential. That's probably the only way you can run high cardinality aggregation and multi table joins correctly like today, right? That's the state of the art right now. It's basically a cost based optimizer. And also that's basically like an overview of the planner side. And also on the compute side, we feature a massively parallel processing kind of computer architecture, right? And that's basically the heart of MPP, I think, is to, you know, is to utilize, you know, all of the nodes, right? When you when you do a computation, when you do execute a query. I think the heart of it is basically we're able to shuffle data between memory while the query is running. Right? So the cool thing we can do is, with that is basically think about a two phase high cardinality aggregation. Right? We can do multi phase aggregation because, you know, we can actually like shuffle the data between nodes, like between memory, like during the query still running. And we can do shuffle drawing. That's basically the only way, you know, to do like scalable to big table drawing, right now. So that shuffling service is essential, you know, in architecture.
Yeah. So that's basically it, a rundown, you know, of the architecture. You can dig into, you know, each one, if you're interested. And you talk about, you know, like the foundation of the system, right? So the scalability or the whatever ability. If I want to conclude the utilities, is that how you pronounce it? Utilities, right? It would be, I think, for scalability. Saros is really built to scale vertically, scale vertically, and scale horizontally without losing much performance, right, to cater to the thousands of QPS of concurrent OLAP queries while also maintaining sub second performance.
And that's the scalability part. And the second I wanna say is predictability and reliability. And this is really we have a really big focus on those, like, customer facing kind of workloads. Right? In customer facing, like, having fast queries, yeah, it's important, but most importantly is your query performance is predictable. Right? One slow query, one very slow query is a lot worse than all of the query being like medium level kind of slow. Right? So we build so many kind of features and functionality optimizations to ensure that your queries are not only fast, but they're also predictable and reliable, right? And even your infrastructure fails, you know, underneath your EC2 instances are gone. Some of your EC2 instances are gone. And we can still maintain that kind of performance. We can dive into that a little later in a bit. Right? And this is second, I think, is predictability and reliability.
And the last one I wanna say is governability. I don't even know if that's a word, but governability, let's go with that. Right? Basically is, okay, we talk about a lot of like like scalability, performance, being fast, being predictable, right? But now we want to not run them in proprietary format. We want to enable them in basically in parquet files, in the Iceberg table format, in something that other engines are familiar with as well, right? To basically allow you to have a single source of truth of data, right, while also having those great things that, you know, about StarRocks, to basically to build those integrations on top of open formats to work with our partners, right, to make that happen, to bring the governability, if that's a word, to our to to the users.
[00:19:29] Tobias Macey:
One of the challenges that always comes up when you're dealing with distributed data systems where you have databases that can scale the storage across multiple nodes is the question of how do you deal with partitioning, sharding, replication? How does that factor into to the ways that you think about structuring your queries, structuring your tables, etcetera? And I'm wondering from a from, like, a scheme of design perspective, how does that influence the ways that people think about what data to store, how to store the data, how to query the data when they're working with StarRocks, and how much of that does StarRocks take over so that you don't have to think about it as much. I'm thinking in particular about with ClickHouse where you have to think about at the time of table creation, what type of replication am I doing for this table? Because if I want to change it down the road, then I effectively have to replicate all of the data to a new table and if I wanna change the sharding key or something like that. Yeah.
[00:20:25] Sida Shen:
Yeah. So, for that, I think that's more like a scalability kind of kind of question. Right? So when you scale with ClickHouse, you have to, like, manually shard. Right? For StarRocks, even the I mean, this is really not a problem for shared data because, like, all of the, the data is persisted in your S3 buckets, right? So when you scale, you know, you only scale the compute. If you don't think about the cache, you know, it's stateless. It doesn't persist any data. But for sure, nothing, you know, the sharding and all of the the data distribution, are all automatic. So you don't really have to, you know, worry about, basically, redistribute it your data, manually, right, when you do want to scale your StarRocks shared nothing cluster. Right? And in terms of like table design, I think this is a multilevel question too. So, yeah, we do have partitioning and partitioning now is is automatic.
Right? So that's a lot less thing that you would have to do, you know, compared to other kind of systems. Right? And also, I think this is more generally with OLAP. Right? I think OLAP right now, they we'd run everything, you know, in in the columnar format, as I mentioned, you know, in the storage side, in the memory side, right? And this is actually more performant than, you know, doing indexes. So all of the actually, including ClickHouse, all app system today, don't really rely on indexes, you know, to have the baseline performance, right? And indexes are basically used for your kind of specialized kind of workloads, right? You want to do like a massive count distinct with like trillions of rows. And that's when you want to build, you know, some kind of index, a bitmap index, right, to solve that kind of problem, secondary indexes, right? So creating table in OLAP, not only the effort we've done, but the effort like that's done like before, even before us. Right? But we did an event factorization. Right?
Table creation or the usage of OLAP has become like so much easier, right, over the past like five, ten years. Right? And we're on a mission, you know, to make it more even more accessible for all of the users out there.
[00:22:37] Tobias Macey:
Another aspect of the architecture is by virtue of the front end nodes being a separate deployable unit from the compute nodes introduces the potential for network latency. And even if you're deploying them collocated on the same hardware, it at least introduces process boundaries. Whereas for, other database engines, there may be more fully vertically integrated. Obviously, that helps with the scalability story and the maintainability, deployability, all of those things. And I'm wondering what you see as the trade offs that are involved in that architectural decision of being able to separate it into tiered layers, both in terms of the the negative trade offs, but also some of the benefits that it enables. And I'm thinking in particular about the shared data layer capabilities of being able to query against the lakehouse formats while also being able to have that shared nothing layer of the data stored on disk?
[00:23:36] Sida Shen:
Yep. So, let's go with the good things first. So first is the languages. Right? We use different languages for different kind of purposes. Right? And FE and B, you know, they have really big, difference in purposes. Right? FE basically, you know, manage metadata to generate query plans. Right? It's really not that compute heavy. Actually, not compute heavy at all. Right? It's a lot of logics. And for B, it's more a lot of like like, parallel processing, really heavy on the compute. So for the two kind of purposes, Fe is much more suitable to do in Java and C plus plus is much more suitable.
B is more suitable to do in C plus plus Right? To separate the two, basically, we have we can use a language that's more suitable, you know, for that particular kind of use case. Right? And this is basically the biggest, benefit that's out there. And talking about trade offs. Yes. There are trade offs. Network is a concern, but there really not that much data being exchanged, you know, from your FE and your b. No. B or c and notes. Right? That's just a query plan, the physical plan. Right? And maybe return the data, but, you know, we do that through f through BE too. So it's really not that much data to exchange. Right? I think there are two limitations to this kind of design. So the first is being the two written in different languages. Right? So if let's say we have a little function that we implement in b, right, we write in c plus plus Right? We cannot really do that function or call that function when only f is up. Right? So we have to, like, have the whole cluster up.
So, you know, this, you know, creates a little bit of trouble, you know, for us, in the on the deploy on the development side. Right? We have to think about, you know, we do actually do have two two two kind of languages and what do we put there? What where do we put where. Right? So this is basically, like, the first kind of limitation. And second is, this is really not that good if your cluster is very small. So, basically, FEMB are two kinds of processes. Right? And we cannot really handle, you know, the resource, management between the two. Right? We have to, like, complete, like, depend on Linux to do that.
It doesn't really do that that well. Right? So if you want to deploy in StarOS in production, what you will want to do is, to have dedicated machines, for your FE nodes. Right? And but when you only need, like, three, two b nodes, sometimes one b node is enough for your for your deployment, and you want a high availability for your FE nodes. Right? You want three FE nodes and one PE node, then that's, you know, a a a waste of, resources when your class is very small. Right? But, you know, StarRise is really built for distributed compute when you have, you know, many of really big dataset and really big concurrency.
Right? So that is actually not that big of a problem too. Right? I hope that answer your question.
[00:26:49] Tobias Macey:
Yeah. And then now digging into the use cases that StarRocks enables, particularly around that tiering of the lakehouse shared data and the high speed OLAP shared nothing, You also have the intermediary bit of being able to cache the lake house data. I'm wondering if you can talk to some of the ways that that changes the data platform architecture and the overall usage patterns that are modified as a result of that capability versus the current landscape of I either have my lake house or I have my OLAP engine, but I have to do some extra work to be able to join them together, or maybe I can use an iceberg table for, you know, in, like, my ClickHouse, but it's gonna be it's still gonna be really slow and just some of the ways that having that unification of the stack changes the overall approach the teams take to the design of their data systems?
[00:27:49] Sida Shen:
Yeah. That's a that's a fantastic question. So, yeah, we do have, just a little bit of background information. Right? So for the for the listener that are not familiar with StarOS architecture. So we do have our own table format and we do have our own file format. That's basically our internal tables. Basically, we talk about share nothing or shared data. We we are all talking about, you know, the proprietary, crazy kind of format that we built just for the best possible performance. Right? You have your new low cardinality, you know, kind of like storage level, like optimizations. Right? And you have all sorts of crazy indexes you can use. Right? And the option and the format itself is more optimized for low latency, high concurrency. Right? And that's basically a share nothing and share data kind of thing you're you're talking about. And also we have Apache Iceberg, right? And Apache Hudi and Delta Lake and Apache Pyman. Right? Those are open table formats.
And they govern ORC parquet files, and those are the open file formats. Right? And so how the whole thing works. So basically, the underlying thing is we want to preserve all of the things that are good about open table formats. Basically is single source of truth data, right, in the open format, right? So we want to with this as a requirement and how can we make the performance faster, Right? So actually, StarRocks is able to, you can actually create materialized views with StarRocks, right, from Apache Iceberg table. And what it does is that it's basically a materialized view is basically a StarRocks a managed StarRocks table in StarRocks' own crazy, own format, own really fast format. Right? So what you can do is create a materialized view. Right? And you can do some kind of, data transformation too if it's needed. Right? And that becomes, you know, StarRocks own kind of format. And also the whole process is managed. So you don't really have to, you know, think about refresh. You don't really think about, you know, is the data you sync. You know, if you configure it correctly, it's gonna work, and you don't really have to worry about that. Right? And, also, the metaverse view features a a automatic query rewrite. So, basically, if you do something like a pre aggregation or a denormalization with Star Atlas metaverse view, you don't really have to change your SQL query to benefit from, you know, from from materialized view. StarOS cost cost based optimizer can do can direct your query, you know, to the most appropriate multires view, right, automatically, so you don't have to worry about that. Right? So that's basically a way we figure out, you know, how to, enable this kind of, like, propriety level kind of, like, low latency, high concurrency, still on the single source of truth data in open formats.
[00:30:42] Tobias Macey:
And by virtue of being able to use one interface to query both your shared nothing data sources where you're guaranteed to have high throughput, high speed, and your lakehouse table formats. Obviously, being able to join across them is gonna include introduce slowdown, but it simplifies the logical approach to how to actually work across that data as well as giving you the ability to, for instance, materialize or build new tables off of the lake house data into the shared nothing architecture layer so that you can consistently have high speed through that. So I'm thinking in terms of, like, the medallion architecture style thing pioneered by Databricks or the typical d b t flow of staging intermediate mark where maybe you stage all of your data as iceberg files, but then in intermediate or in mark layer, you materializing those into the StarRocks shared nothing layer. I'm just wondering what you're seeing as some of the ways that people are approaching that style of workflow and some of the the benefits that they're seeing as a result.
[00:31:49] Sida Shen:
Yeah. So yeah. That's a great question. And, actually, there are Matron is used basically by, like, almost all of our users right now. Right? So directly query, we can get great performance. You can think about, like, around three to five times, you know, compared to Trino, right, with cache all for cache all. And but Matura's view just take this to, you know, a whole other level. Right? This is basically a painless way for you to add, denormalization pipelines or pre aggregation pipelines on demand. Right? The way I say on demand is because of the query reliability. Right? So you really don't have to worry about whether this query is still gonna work after I build this pipeline. Let's think about you build something with Spark, right? Build a demoized table with Spark.
You have to change your SQL script or you reconfigure your entire dashboard, you know, for that thing to take effect, right? And that's really painful for our users, you know, before they move to StarRocks because you basically have to figure out all of the the pre processing pipeline before you actually do your application and run your application because you have to actually, like, go bother your user, you know, to change your SQL script. And that's nothing that anyone wants to do. Right? Yeah. So with this, you can actually, you know, just build your application, right, and figure out the performance afterward and to add demonization pipelines afterward. And you end up adding so many less pre computation pipelines. Right? And also, you end up going to production a lot faster, you know, with query rewrite capability of materialized view.
This is, you know, basically, the main way of how our users are basically blending our proprietary kind of format, optimized format with like an open format, like Apache Iceberg. Right? And there are actually other use cases too. You know, it doesn't only help with query. Our proprietary format has also has a primary key table, right, probably familiar with. It has a record level index, right, that's able to accelerate your data freshness to like sub ten seconds, you know, directly on top of columnar storage. And that's, you know, that has to do something with memory. So So that's not something that Apache Iceberg can do. So we actually have a user, that use primary key table as the caching layer when they basically ingest the data in. So it first lands in the primary key table. So you get that, you know, ten second data freshness.
And they build a whole mechanism to basically sync the data from your from their primary key table to Apache Iceberg, like, every hour. Right? Delete and do a truncate of the table, you know, on top and make that whole, like, atomic operation. So that's a crazy workload. Right? So they actually presented this whole thing, last year at the Iceberg Summit, as Tencent Games. So they were basically able to get sub ten second data freshness on top of, like, petabytes of data. Right? And single source of truth still on Apache Iceberg and running that with MetaMask View, running that off, like, like, sub second query query performance. It's just some crazy numbers If you come if you actually combine the two correctly. Right? This is that's probably, like, one of the most exciting ways you can use StarRocks, you know, I've seen.
[00:35:11] Tobias Macey:
And that was gonna be my next question. Was the reverse of landing your data in StarRocks for high rate throughput and being able to do read after rate querying in those other applications. So either using it for, an application storage layer, maybe not necessarily as the primary database, but for being able to push analytical events similar to the use cases that ClickHouse is often deployed for, and then being able to periodically flush that data down to iceberg so that you maintain that as a raw storage table, but you're also still able to use it for being able to power interactive use cases as the data gets rewritten?
[00:35:51] Sida Shen:
That's a really exciting thing. I mean, that's not a problem that I think I think open table formats are looking to solve. Right? And because that's a lot of that is engine specific, you know, a lot of that depends on memory. Right? So, you know, that's one really innovative way you can solve that problem, have that data freshness.
[00:36:08] Tobias Macey:
In terms of the lakehouse support, you mentioned Iceberg, Delta, Hoody. There are a large and growing number of table formats that are available. Obviously, they all have different use cases that they prioritize. And I'm wondering how the work that you're doing in StarRocks focuses on either subsets of that functionality or just the overall problem matrix of supporting all of those tables and being able to support all of the features versus what are the the core elements of those formats that you care most about? And particularly as people ask you to add more and more tables to the list.
[00:36:46] Sida Shen:
Yeah. That's a great question. That's a great question. So, for all of the table formats, I think almost all of them, I think of, you know, even like just Iceberg Delta, Paimon, and Hudi, and even Les. Right? They all, like, born with similar reasons. They're all solving this very similar problem, which is Hive table just doesn't work anymore with the scale we're operating on, with the frequency that we want to want to update our data. Right? The frequency we want to add new columns, to delete columns, you know, to do schema evolution. Right? It's not keeping up. And I think they always solve very, very similar problems, but they started at different places. And I do see a lot of the table formats converge, you know, and have a shared functionality a lot more than we see like two, three years ago. And yesterday at the iceberg, Isober Summit, a lot of the features that are introduced like deletion vector, like, let's say, geospatial kind of features, they're shared between Delta and Iceberg.
So I imagine actually the opposite. So going forward, we should be able to see a lot less proprietary features, you know, that are in those table formats. And also what helps more is that we have innovative projects right now, such as Delta Uniform in Delta two point o and Apache x table. That basically, you know, is one copy of Parquet files, and it generates three copies of the metadata. Metadata is a lot a lot a lot lighter, you know, so they can, you know, three three copies of metadata and one copy of the data file. So in that way, you know, we should be able to work with more like a unified interface, for our query engines. That's, tomorrow. That's, that should come that we really want to come, like, actually tomorrow.
[00:38:38] Tobias Macey:
And then for cases where teams already have some investment in one of these other query engines, whether they're using Trino, Dremio, etcetera, for their lakehouse use case or they're using Snowflake, etcetera, for their warehousing or they're using ClickHouse or Druid for OLAP or maybe they're using some combination of all of those. What are the situations where you typically see teams addressing StarRocks as a supplementary tool versus a replacement of any or all of those?
[00:39:10] Sida Shen:
A supplementary. That's more I think we see a lot more of that, you know, with query edges. Right? Because the heart of, you know, like, lake house is basically one cop one giant copy of data and 15,000 different tools on top of it using that one one copy of data without moving moving the data. And that really sparks, you know, innovation on our side because that really drives specialized tools. And StarRocks is that type of specialized tools on the lakehouse. Right? And we're probably the only one out there that is really hyper focused on high concurrency, low latency, reliability, you know, those kind of like customer facing, like, workloads. So we see do see a whole ton of StarRocks running site on the side of, Apache Spark.
So Spark do a lot of the the business transformation, do a lot of, like, the compaction. Right? And we handle a lot of, like, the interactive queries and we handle a lot of the, like, the high concurrency, like, the dashboards and the reports. And also, you know, like customer facing, like, application that they're directly serving all that queries to their customers and run everything on the fly. So that is probably the biggest. And, also, we do see a lot of ClickHouse user, but that's more often a direct replacement. And they, you know, they they just look at look at their, like like, 10,000 denormalized table and she get frustrated and look at their build. They, you know, they cry. Right? And they wanna find something that can run joints on the fly, and, you know, they find us. And later, they discovered, you know, even more reason, you know, to stay on Star Wars because of the scalability and because of the data upstairs, because the real time kind of, you know, real time data upstairs, you can actually, you know, have your data change. What a surprise. Right?
Like, in real time. So they end up, like, slowly, slowly, like, replacing the whole system. But denormalization is basically most of the reason that a lot of the ClickHouse choose StarRise in the first place.
[00:41:11] Tobias Macey:
Digging more into the ClickHouse comparison, one of the things that gives them a built in advantage at the moment is the overall ecosystem investment around it, where they've been around for a while, they have gained a good reputation. There are a lot of systems that presume ClickHouse as their storage layer. I'm thinking even in particular of some very new systems such as, AI gateways that rely on ClickHouse as the storage layer for storing some of user interaction data, telemetry data, etcetera. And I'm wondering in terms of that piece of the puzzle for StarRocks, how you're approaching the
[00:41:50] Sida Shen:
ease of integration, ease of adoption from a protocol and ecosystem layer. Protocol and ecosystem, we definitely are spending a lot more of our effort on those. But I think in our experience, that's really not a a huge blocker for a lot of the users because a lot of things they're experiencing when they're going trying to go to production, when they're trying to scale, and those problems are much more difficult to solve than a lot of the ease of, you know, a lot of the, the higher level kind of issues. And Star Wars is not really exactly yeah, we do lack when compared to ClickHouse, you know, functions. But we are working with the community, you know, to make Star Wars a much more a mature product.
[00:42:32] Tobias Macey:
And digging a bit further into my brief mention just now of the AI ecosystem, obviously, that's another area that has been driving a lot of development in terms of completely new data engines as well as feature additions to existing ones. And I know that StarRocks has an array data type which can be used for storing vectors. And I'm just curious what the current stance is of StarRocks in that broad ecosystem of AI applications and being able to enable them and deal with vector embeddings, etcetera.
[00:43:04] Sida Shen:
Yes. So I think I think it's it's better it's it's good for me to to give, like, a general picture and then, like, some features, explanations. Right? So so first, forgive me to going back to Lakehouse again, but this is really Lakehouse. So, Lakehouse, they did that a long time ago, right, since 2017, '20 '18. It didn't really become, like, that popular, that big until just a few years ago. Right? The reason for that is I think AI is really a big push, you know, for the lakehouse adoption or the lakehouse kind of like conversation. Because like with data warehouse, with, you know, just think about like data warehouse kind of workloads. Five, ten years ago, it's probably possible that one system can take care of, you know, all of your needs. Right? But with AI just right around the corner, nobody knows what's gonna come tomorrow. Right? What kind of new tools I'm gonna have to use the next day?
So user taking ownership of their own data to store their data in parquet files instead of, let's say, Staros file format, right, has become a lot more important in the past five years just because of the, you know, the boom of AI, machine learning kind of workloads. And that's, I think, a lot of the reason why we started to build this whole, like, lake house integration. Right? Because we do want our user to have their own data, to own their own data, and to have the freedom, you know, to choose what kind of tools they can use tomorrow. So that's more like the the lakehouse kind of conversation. And what makes lakehouse even more important today, I think, is is the the the birth of agents, analytical agents. Agents are they work a lot faster. They don't really have to sit there and think. You know, they can go through like 10,000 tokens, like a few seconds. Right? Having a universal view of your entire dataset inside of your organization is something that's very important for AI agents. So you want to have that very strong single source of truth kind of data layer inside of your organizations to better leverage, you know, AI agents. Right? All sorts of agents that is not only, let's say, analytical type of agents. And that I think, you know, give it even bigger push. That's the whole like lake house thing, right? I think just talking about agents, you know, agent is actually a lot of users in our community already building analytical agents with StarRocks. I can't really say who yet because they're still figuring it out in the early phase. Right? But we do see a lot of very exciting new projects at Como. I think StarRocks really pair well with agents because with agents, you have to have agent scale kind of concurrency. Nobody know, like, how big that's gonna be. And low latency because we're gonna become, like, even bigger kind of a requirement because they work just so much faster than I do. And also you have automation in your pipeline. You have less human in the loop. You need your query engine to be able to self optimize, right, to not have that many knobs you have to talk. You you have to you have to move, right. You have to be able to learn from, the queries that's previously executed to constantly evolve and to make sure that your query plans are always, you know, the most optimal. And we do have that future building, by the way. And I think just a characteristic of StarRocks is really good to solve a lot of the agent problem, that we see in the industry today. Right? And so that's basically, you know, the bigger thing, right? The more forward looking thing. And feature wise, you know, yes, Tencent Games, did contribute the vector index functionality.
Right? So you can use StarRocks to basically do a rack, right, or to do similarity search as Tencent Games. Tencent do they think they do like a reverse image search. And you can do a multi you can do like a hybrid search, right, because we are so good at SQL query. Right? You can do a lot more, you know, with your with your vector search with StarRise. And that feature is is available if you want to try that out today. And also to better incorporate in the whole AI ecosystem, we really strengthened our Python integration. Right now we have Arrow support, now we have a Python UDF, and we're gonna have a lot more, you know, going to the future. That answer your question.
[00:47:26] Tobias Macey:
Absolutely. Yeah. Definitely a lot of exciting stuff to dig into. And in your experience of working on this project, working in the community and with your customers at cellular data, what are some of the most interesting or innovative or unexpected ways that you've seen StarRocks applied?
[00:47:43] Sida Shen:
Okay. So, let's think, you know, the first one has to be has to be Tencent Tencent Games, right, with their crazy kind of, you know, subtesting and data freshness directly on top of Apache Iceberg with petabytes of data sub second latency. Right? That is that is a very innovative way, you know, of utilizing our our format with open formats as the lakehouse, right, innovator. And something also is, I think, a lot more interesting is lakehouse, StarRocks based, or StarRocks powered lakehouse, or StarRocks and Apache Iceberg are actually solving those problems nobody has believed that is possible. Like customer facing analytics, right? That's something that requires very low latency, high concurrency and stability of your latency, reliability of your latency. Right? We have users that use StarOS and Apache Iceberg in production, serving their customers, their paying customers. Right? So first example can be let's think about Herdwatch.
Herdwatch is actually a customer. They build like customer facing dashboards for cows, for like livestocks. They have like tens of millions of livestock they have to monitor for their customers, which are, you know, farms. And they use StarRocks and Apache Iceberg to, to basically, replace Athena, right, and brought down, create latency from like tens of seconds to like sub second. At TRM Labs is is another one, I've, I've mentioned too. Right? And we have a lot more use cases such as customer phasing, such as Pinterest. Pinterest use us to, you know, do that kind of you think of, like, YouTube analytics page. Right? I'm sure you're familiar with that. So those for, like, content creators and for advertisers, you'll see in real time how their ads are running, right, to make a real time decision on those. Right? With that and with Coinbase, a customer phasing, just a a bunch more. We have a bunch more case studies on the starus.io website. You can check it out. And it's one thing I wanna wanna mention, unexpected.
I did just talk to a user last week. We have that primary key table that can do real time data ops search, but it's not a TP database. That guy is not doing OLAP. That's that guy's using StarOps as a transactional database, a transactional database only to serve his operational workload. It's been working fine for, like, a year and a half. I just heard about that, like, last week. Yeah. That's probably the most, like, unexpected way. You know, people use it. You probably shouldn't, but, you know, it's it's it's good to see. It's always fun to see.
[00:50:24] Tobias Macey:
And in your experience of working on this technology, working with this community, and in this problem domain, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:50:35] Sida Shen:
I think challenge wise, we've been through so many difficult things. I mean, building our cost based optimizer was a lot more difficult than we thought. It's basically an optimization problem. It's basically, you know, we do need to work with the community a lot in the first, like, two years to make it basically, we can confidently say, you know, a year or some change to confidently say that we can turn it on by default. And thank God we're open source or else a lot of the features that we built today would not be possible without our initial adopters, trying out and showing us the crazy stupid query plans that our CPO has generated.
Right? Without those, you know, that wouldn't be possible. And I think the biggest challenge that we see is basically a scale challenge. We do test our system really rigorously. Right? We think terabytes of data that's that's that's big, 10 terabyte of data that's big. But once we get put into production systems, when 10 terabyte is not the data size, it's a scanned data size after pruning, right? And that's very difficult, right? With hundreds of StarRocks instances with like each with like fifty, fifty plus cores. And a lot of the problems do show up. You know, you see a lot more bottleneck when you see you find new bottlenecks in your system, you know, when you push a system to that kind of scale, right? For example, when there's a data skew, right? Shuffling can be, you know, can be a huge challenge. And that's a problem we solve for like, you know, we spent like almost half a year to solve, Right? We didn't even think that was there, you know, in the first place. And that's also, you know, thanks to our to our open source, you know, us being open source and our open source users. You know, we've been iterated for the past, like, five years to make this system, you know, as stable and as performant and as reliable today.
[00:52:23] Tobias Macey:
For people who are interested in simplifying their data stack or they really care about speed and latency, what are the cases where StarRocks is the wrong choice?
[00:52:34] Sida Shen:
Yeah. We have to talk about that guide again. Yeah. Don't use it as a transactional database. There are better options out there. If you want to, you know, analyze your transactional database, read the CDC and use something like a, Debezium, right, to CDC it into StarRocks, and we support updates. So, you know, that works. Don't use it as a transactional database. That's probably not your best bet. And also for very large ETL kind of workloads that runs for, like, hours or days, you're better off probably using something like Spark, something that's designed for those kind of workloads, especially, you know, if you're on a data lake. You can just run the tools, you know, simultaneously on top of one one copy of data.
So yeah. So that's basically it.
[00:53:18] Tobias Macey:
And as you continue to build and iterate on the StarRocks engine and the ecosystem around it, what are some of the things you have planned for the near to medium term or any particular projects or problem areas or or capabilities that you're excited to explore?
[00:53:33] Sida Shen:
Yeah. So this year, we're really looking to to further stabilize, you know, the query performance, not only just for your for our own, like, shared data or shared nothing kind of architecture. Right? But more for like Apache Iceberg. Right? So how to deal with, unmaintained or less maintained iceberg metadata files, right, how to better read them parallelly, right, how to better cache them, right? How to make that metadata cache more reliable, right? And also how to make our own data cache more reliable. We've already built a lot of features just to solve that specific problem. And we're going to work with the community and to make those features more mature and more robust, right? So you can end up running the craziest kind of workloads on your open data, on your own data. And this is exactly what we want in the future, not only for your data kind of workload, but only also for your AI. So we take apart, you know, into that big evolution, making everything open. And that is really what we're looking for. Further power customer facing analytics on that open type of format.
[00:54:43] Tobias Macey:
Are there any other aspects of StarRocks, this overall space of high throughput, high speed queries across data lake, and, shared nothing data layers and the capabilities that it enables that we didn't discuss yet that you'd like to cover before you close out the show? Yeah. There are a lot more. I mean, I think one aspect is
[00:55:05] Sida Shen:
is caching, if you're interested in, we can talk about. Right? Caching is something that we've built actually pretty long time ago for for our shared data architecture. Right? Actually, our first iteration of our caching was it already looked really good on benchmark. That that's why I don't really like like benchmark that much because it doesn't show the full picture. Right? If you run everything hot on the benchmark, which everybody do, it doesn't show, like, actually how how well your cache, you know, can hold hot data. Right? But in customer facing analytics, that's, like, the most important aspect of of a cache, how to prevent your entire cache getting flushed by a big scan and how to make sure that if some kind of a service triggers a query, you make sure that, you know, there's cache there. You're not reading off S3, like a hundred milliseconds per scan. That's not acceptable. Right? So we build a whole bunch of things, you know, just for the cache itself, like a segmented cache. You know, we have a mechanism to protect hot data, you know, from it being, like, scanned by, evicted by BigScan or BigQuery. And we have compute replica. So basically, we have multiple replica of the same cache stored in different nodes that if one of your EC2 instance fail, you still have your performance SLA. Because, like, for customer facing, we actually see a lot of performance SLA higher than more strict actually than EC2 instances, availability of EC2 instances. And, you know, it's just a lot of things. And I think for cache, you know, there's another thing. Like, for example, like, cache warm up. We have a warm up mechanism for cache. That's really not really that conventional.
You know, you see in those, like, interactive query engines because, like, stability of ins of latency was is really not that much big of a deal in those systems. But for customer facing, we cannot have, we cannot afford, you know, to have any kind of slow queries. And that's just caching. And we build so much more stuff, you know, to to help you stabilize your query. You can, you know, go to our release note, and we have a bunch of features that just got released, you know, the three dot two release oh, the three dot four release that you can check out. That's basically all I have. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the StarRocks team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today? There's so many gaps.
There's so many gaps. I think the biggest gap right now is the requirement from the customers changed so fast. You know, from just a cloud data warehouse, is good enough for me to I want to run everything on Parquet files. You know, this is a big shift. But a lot of the tools that we're using today were developed, like like, five, ten years ago when this requirement was not a thing. Just think about, like, in our field, query engines. We we weren't, like, thinking about, like, building the fastest query engine for Hive for Hive table. Right? And back then was, you know, like, connecting to all of your data sources and get okay performance or ingest into a proprietary format to get the best kind of performance. Now we want both, and the tools are not here yet. And we try to close that gap, but there are more gaps in this field, you know, just this transition of, you know, to open formats. So I think this is definitely,
[00:58:28] Tobias Macey:
the biggest gap we see right now. Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing on StarRocks and the capabilities and use cases that it supports. Definitely a very interesting project, very enticing technology that I intend to spend more time investigating. So I appreciate the time and energy that you and the rest of the team put into that, and I hope you enjoy the rest of your day. Thank you. Thank you. Appreciate that. Thank you for listening. Thank you.
[00:59:01] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey. And today, I'm interviewing Sita Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patterns. So, Sita, can you start by introducing yourself?
[00:00:59] Sida Shen:
Hey. Hello, everyone. This is Sida. I am a product manager at Celad Data and also a Star Wars contributor.
[00:01:08] Tobias Macey:
So hello, everybody. Thank you for tuning in today. And do you remember how you first got started working in data?
[00:01:14] Sida Shen:
Yes. So actually, my whole career started actually with like operational research and actually applied mathematics. So I was doing a lot of like mathematical programming and that is like natural to like machine learning optimization. Right? So I was doing, I was basically a data scientist, you know, back then. And I kind of realized that as we move along, right, algorithms are always gonna not really, you're not really gonna differentiate yourself, like, from algorithms. Right? Data is a foundation of everything. Right? For, smaller models to bigger models to LMs right now. Right? Data is always gonna be your differentiator. How do you manage your data? How do you have good governance on top of your data? Right? That's why I got really interested in database systems.
And that's, you know, how I end up, like, joining StarRocks and started actually working on OLAP database, StarRocks.
[00:02:07] Tobias Macey:
And so digging into StarRocks itself,
[00:02:10] Tobias Macey:
can you give a bit of an overview about what it is that you're building and some of the story behind how it got started and the objectives that you're focused on?
[00:02:18] Sida Shen:
Yes. So StarRocks is a high performance, Lakehouse Korea engine, right? That was basically forked from Doris in 2020. It's Apache licensed project that was donated to the Linux Foundation in 2023. Right? So basically in 2020, a handful of, Doris maintainers, PMC members, we saw a basically a challenge, in the in the whole like analytics field, right? So basically it was the performance challenge and the access of pre competition challenge, right? People just wanted faster queries, right? But they wanted fast queries also with join queries run on the fly, right? Every everybody wants to run things on the fly, you know, for the better flexibility, right? So for the simplicity of the pipeline, right? So for the reason they don't have to build that whole thing, you know, to duplicate their data or those like pre computation, denormalization pipelines.
So from the ground up, we built it, you know, a bunch of features, right, to support these use cases. We built it a cost based optimizer, which is really essential, you know, for scalable, fast joining performance, right? And we also basically vectorize all of our operator. You know, it's a C plus plus project, right? Doing that is reasonable, right? And with all of the SIMD optimization, and that is, you know, give us like a great foundation of, you know, basically fast queries, fast OLAP queries, right? And we also build, you know, something like a primary key table, right? We have a role level index. You know, that's for, like, subset second level, data upstairs and deletes directly on top of, columnar source. Right? So that's the whole thing. And around 2022, we saw the rise of open formats, right, from Apache Iceberg. Basically, open table formats governed, packet files. You can say it say it's something like that. Right?
So user has been more, they have more, willing to to take ownership of their own data. Right? And this is more augmented by, you know, the recent rise of AI. Right? We don't know what kind of, workloads that you wanna run on your own data. And we don't think there is one tool that can take care of all of your data needs. Right? So now is the time for user to basically take ownership of their own data to store things in an open format so they can, you know, plug in tools that they want without actually duplicating their data to a proprietary system, which destroy their data governance.
So now we've been working on that and basically to port all of the great performance or the low latency, high concurrency, that customer facing like workloads and enable them on open formats, on lakehouse, on your shared storage. So basically, that's the rundown of the story that things we've built. We can dig into, you know, each one of them as we go.
[00:05:18] Tobias Macey:
As far as the overall market and ecosystem for query engines, data compute engines. It has exploded over the past few years. And even before that, there were numerous options on the market. As you mentioned, there is currently a bit of a separation between lake house engines and the vein of things like Trino and Dremio and even Databricks. And then there's the data warehouse engines such as Redshift and BigQuery and Snowflake. And then we've got OLAP engines, which are focused on high throughput, high speed for things like ClickHouse.
And and it seems like what you're building with StarRocks is a little bit of a hybrid of all of those. And I'm wondering if you can talk to some of the ways that you think about the differentiating factors from StarRocks versus the broad competitive landscape?
[00:06:12] Sida Shen:
Yeah. Sure. I think let's start with the last thing you mentioned. So basically, like the click house, like the your your your Druid. Right? Those, like, real time OLAP databases that user use a lot for their like like internal like real time dashboards or, you know, customer facing like workloads, right? One of the biggest challenge, for those systems is that when they designed the system, they didn't really have, on the fly joy in their mind, you know, when they designed the system, right? And that's a big problem because database management system, you know, we know today was like developed in the like the 1970s, right?
Relational databases has been the best practice since then. And somehow, like, just five, ten years ago, we just forgot about it, right? And say, okay, one big flat table, is the best practice and it's really not. It probably solved your your performance challenges, but it really introduced a lot more, like, challenges. Right? Firstly, you have to, you know, to build denomized table. Right? You basically you're duplicating your data. Basically, easily 10 x your storage. And, also, you're building excessive denormalization pipelines. Right? Doing that for batch, you know, for, like, use of Spark, you know, but that's probably okay. But doing that in real time, right, you need to introduce stream processing. You need to introduce incremental compute, you know, to catch up with the, you know, the data freshness requirement. Right? That's a whole another, proprietary set of skill set you have to have. Right? And that's really expensive. That's really you need, like, some somebody special that actually codes, you know, to actually do that. Right? Build your, like, stream processing pipeline. Right? And also, demonization lock your data in a single view kind of format. Right? Basically, let's think about you have a a smaller left table. Right? And you have, like, a bigger right table. Right? You have a, basically, dimension table and a fact table. And one day you want to update something, you know, on your dimension table, right, you just do that one one column, not that many rows is okay. Right? But imagine, like, with denormalization, you have a big flat table underneath. Right? With the row number, basically, number of rows basically equal to your fact table and you still want to update that one column, guess what? You have to you know, you're stuck at data backfilling for, like, the next, like, three days.
And this is actually a huge problem, and we have a user that operates on petabytes of data. They're building a metrics platform. A metric platform, you should be able to add metrics easily, to change metrics easily, right? Because they were on Apache Druid, they couldn't do join. They had to do denormalization. So each schema change or each edit or, you know, change in their metric took, like, two to three days, you know, just because of data backfilling. And that's, you know, not that's something that you don't want to have just to have, you know, good enough performance. And at StarRocks, we've been working on this problem since, like, you know, it was open source. We basically added so many features, to the system that we make join actually run fast enough, so you can ditch your demonization pipeline. You don't have to rely on your demonization pipeline. Denimization should never be a default, even for all app system, even for customer facing alley. It should be something that you build on demand, right, to fix those crazy workloads.
Yeah. So that's basically to compare with, the real time OLAP databases. And the next I will I wanna talk about is basically the Trinos, the Prestos, right? The Dreameos, those kind of, you know, those query engines that work on top of a shared storage on a data lake, right? We see the data lake house kind of conversation started, you know, when the data breaks, when data breaks, introduced Delta, when Netflix introduce Apache Iceberg. Right now, we can do a lot more, you know, basically on top of Parquet files. So before Hive, everything needs to touch disk. You know, you want to do schema evolution, right? That needs to touch disk. Everything really relies on the HDFS, right? And HDFS is really not that fast, right? So that created a lot of problem. And that's, you know, born the project like Apache Isenberg, Apache Hudi, and Delta Lake, Right? Basically, we're adding data warehouse like features, you know, to Parquet files, to open formats. So why are we doing that? It's because now we can run all sorts of workloads, you know, directly on top of one single source of truth data. Right? So, basically, you can introduce all sorts of tools and different kind of workloads, like, not having to move data around, not having to duplicate your data to 15 neural different copies and try to, like, synchronize all of them, and that's just impossible to do.
Right? And but one of the challenges we see, you know, in the field is that although, you know, the storage is ready for a lot of the dashboard like workloads, But a lot of the query engines that we have today are not really built for this kind of workloads. For example, let's say Trino, for example. Right? Trino is really known. You know, I learned Trino. It's an engine that really can connect to all kinds of data sources, right, and get okay performance. You can come connect to MySQL and you could still do predict and push that. Right? You can still get that okay performance, but the engine itself is not a % optimized for the best possible performance for one, data source such as Apache Iceberg. Right? So it's very difficult to scale those systems, you know, to actually take over, you know, the kind of crazy workloads we run with, like, real time data warehouses, all app databases.
Right? And that's why StarRocks exists. Right? StarRocks really bridge that gap. You know, we really try to, bring that data warehouse like performance directly to your lakehouse. Right? So you can run those, like, customer facing, like, workloads. You could run those, like, dashboards with thousands of people refreshing at the same time. Right? That kind of concurrency, right, we can handle. So basically complete the puzzle, you know, for the lakehouse kind of paradigm. Right? And this is the second, I think, for the for the query engines. Now we have other, you know, kind of cloud based, you know, data warehouses, right? You think of Snowflake, you think of Databricks, and you think of, maybe even Athena, Redshift, and BigQuery.
And those are great projects and different in their own ways. Right? So, what we do is we compensate, you know, what they don't do. Basically, it's low latency, high concurrency. Right? Those system are really not designed to scale to that level, you know, to scale to, like, sub second level latency. Right? So we basically complete that part of the tool.
[00:13:00] Tobias Macey:
And before we dig a lot more into different use cases, it's probably worth exploring the architectural design of StarRocks and the ways that that architecture enables the different capabilities that you're focused on. So I'm wondering if you can just give a bit of a high level overview about how the system is implemented and, some of the core, I guess, architectural ilities of scalability, durability, etcetera, that are those core abilities that are foundational to the design and the premise of StarRocks.
[00:13:32] Sida Shen:
Yeah. Of course. So StarRocks itself is, is is incredibly simple. I mean, the thing itself is has two kind of processes. So there is, Effie, front end nodes. Basically, it's responsible for, like, like, taking the queries, like re understanding the query, generating query plans and do metadata management. Right? And also we have the B nodes or in our shared data mode that's called compute nodes. Right? Those are basically like the workers, you know, the workers of the the Sara cluster, right? They're responsible for actually executing the query plan, right? The FE generates and also stores data or adds like an IO, you know, to read, to scan data from external shared storage, right? And architectural wise, I think some of the most important thing for StarRocks is first of all, is written C plus plus right? And we have, vectorized all of the operators.
So, with SIMD, and that really give you the good, like, foundational good performance for OLAP, right? Because you you think about, like, like from the bottom, right, columnar storage. And that's basically storing things like a columnar format in column. You think of vectorization, that's basically, you know, doing things in memory, doing it in big chunks, doing it in column. And also, you know, with SIMD, right, that's like processing multiple data points, you know, with a CPU, with a single CPU cycle, right? That's batch, right? That's in column two, right? So basically, all that, the more batch you run, probably the faster you're going to go, Right? So this whole architecture is really optimized, for those like OLAP style workloads. So basically think of, you know, aggregations and joins, you know, group buys and high quality group buys and things. So this is the basic architecture and StarRocks also, you know, on the planner side, it has a cost based optimizer.
Basically, a StarRocks instead of based on like heuristic rules, right? StarRocks actually collects statistics from your own data, fresh statistics and use that to compute the cost of each like candidate query plan and use that as a base, you know, to basically to come up with the best possible query plan, Right? And that's really essential. That's probably the only way you can run high cardinality aggregation and multi table joins correctly like today, right? That's the state of the art right now. It's basically a cost based optimizer. And also that's basically like an overview of the planner side. And also on the compute side, we feature a massively parallel processing kind of computer architecture, right? And that's basically the heart of MPP, I think, is to, you know, is to utilize, you know, all of the nodes, right? When you when you do a computation, when you do execute a query. I think the heart of it is basically we're able to shuffle data between memory while the query is running. Right? So the cool thing we can do is, with that is basically think about a two phase high cardinality aggregation. Right? We can do multi phase aggregation because, you know, we can actually like shuffle the data between nodes, like between memory, like during the query still running. And we can do shuffle drawing. That's basically the only way, you know, to do like scalable to big table drawing, right now. So that shuffling service is essential, you know, in architecture.
Yeah. So that's basically it, a rundown, you know, of the architecture. You can dig into, you know, each one, if you're interested. And you talk about, you know, like the foundation of the system, right? So the scalability or the whatever ability. If I want to conclude the utilities, is that how you pronounce it? Utilities, right? It would be, I think, for scalability. Saros is really built to scale vertically, scale vertically, and scale horizontally without losing much performance, right, to cater to the thousands of QPS of concurrent OLAP queries while also maintaining sub second performance.
And that's the scalability part. And the second I wanna say is predictability and reliability. And this is really we have a really big focus on those, like, customer facing kind of workloads. Right? In customer facing, like, having fast queries, yeah, it's important, but most importantly is your query performance is predictable. Right? One slow query, one very slow query is a lot worse than all of the query being like medium level kind of slow. Right? So we build so many kind of features and functionality optimizations to ensure that your queries are not only fast, but they're also predictable and reliable, right? And even your infrastructure fails, you know, underneath your EC2 instances are gone. Some of your EC2 instances are gone. And we can still maintain that kind of performance. We can dive into that a little later in a bit. Right? And this is second, I think, is predictability and reliability.
And the last one I wanna say is governability. I don't even know if that's a word, but governability, let's go with that. Right? Basically is, okay, we talk about a lot of like like scalability, performance, being fast, being predictable, right? But now we want to not run them in proprietary format. We want to enable them in basically in parquet files, in the Iceberg table format, in something that other engines are familiar with as well, right? To basically allow you to have a single source of truth of data, right, while also having those great things that, you know, about StarRocks, to basically to build those integrations on top of open formats to work with our partners, right, to make that happen, to bring the governability, if that's a word, to our to to the users.
[00:19:29] Tobias Macey:
One of the challenges that always comes up when you're dealing with distributed data systems where you have databases that can scale the storage across multiple nodes is the question of how do you deal with partitioning, sharding, replication? How does that factor into to the ways that you think about structuring your queries, structuring your tables, etcetera? And I'm wondering from a from, like, a scheme of design perspective, how does that influence the ways that people think about what data to store, how to store the data, how to query the data when they're working with StarRocks, and how much of that does StarRocks take over so that you don't have to think about it as much. I'm thinking in particular about with ClickHouse where you have to think about at the time of table creation, what type of replication am I doing for this table? Because if I want to change it down the road, then I effectively have to replicate all of the data to a new table and if I wanna change the sharding key or something like that. Yeah.
[00:20:25] Sida Shen:
Yeah. So, for that, I think that's more like a scalability kind of kind of question. Right? So when you scale with ClickHouse, you have to, like, manually shard. Right? For StarRocks, even the I mean, this is really not a problem for shared data because, like, all of the, the data is persisted in your S3 buckets, right? So when you scale, you know, you only scale the compute. If you don't think about the cache, you know, it's stateless. It doesn't persist any data. But for sure, nothing, you know, the sharding and all of the the data distribution, are all automatic. So you don't really have to, you know, worry about, basically, redistribute it your data, manually, right, when you do want to scale your StarRocks shared nothing cluster. Right? And in terms of like table design, I think this is a multilevel question too. So, yeah, we do have partitioning and partitioning now is is automatic.
Right? So that's a lot less thing that you would have to do, you know, compared to other kind of systems. Right? And also, I think this is more generally with OLAP. Right? I think OLAP right now, they we'd run everything, you know, in in the columnar format, as I mentioned, you know, in the storage side, in the memory side, right? And this is actually more performant than, you know, doing indexes. So all of the actually, including ClickHouse, all app system today, don't really rely on indexes, you know, to have the baseline performance, right? And indexes are basically used for your kind of specialized kind of workloads, right? You want to do like a massive count distinct with like trillions of rows. And that's when you want to build, you know, some kind of index, a bitmap index, right, to solve that kind of problem, secondary indexes, right? So creating table in OLAP, not only the effort we've done, but the effort like that's done like before, even before us. Right? But we did an event factorization. Right?
Table creation or the usage of OLAP has become like so much easier, right, over the past like five, ten years. Right? And we're on a mission, you know, to make it more even more accessible for all of the users out there.
[00:22:37] Tobias Macey:
Another aspect of the architecture is by virtue of the front end nodes being a separate deployable unit from the compute nodes introduces the potential for network latency. And even if you're deploying them collocated on the same hardware, it at least introduces process boundaries. Whereas for, other database engines, there may be more fully vertically integrated. Obviously, that helps with the scalability story and the maintainability, deployability, all of those things. And I'm wondering what you see as the trade offs that are involved in that architectural decision of being able to separate it into tiered layers, both in terms of the the negative trade offs, but also some of the benefits that it enables. And I'm thinking in particular about the shared data layer capabilities of being able to query against the lakehouse formats while also being able to have that shared nothing layer of the data stored on disk?
[00:23:36] Sida Shen:
Yep. So, let's go with the good things first. So first is the languages. Right? We use different languages for different kind of purposes. Right? And FE and B, you know, they have really big, difference in purposes. Right? FE basically, you know, manage metadata to generate query plans. Right? It's really not that compute heavy. Actually, not compute heavy at all. Right? It's a lot of logics. And for B, it's more a lot of like like, parallel processing, really heavy on the compute. So for the two kind of purposes, Fe is much more suitable to do in Java and C plus plus is much more suitable.
B is more suitable to do in C plus plus Right? To separate the two, basically, we have we can use a language that's more suitable, you know, for that particular kind of use case. Right? And this is basically the biggest, benefit that's out there. And talking about trade offs. Yes. There are trade offs. Network is a concern, but there really not that much data being exchanged, you know, from your FE and your b. No. B or c and notes. Right? That's just a query plan, the physical plan. Right? And maybe return the data, but, you know, we do that through f through BE too. So it's really not that much data to exchange. Right? I think there are two limitations to this kind of design. So the first is being the two written in different languages. Right? So if let's say we have a little function that we implement in b, right, we write in c plus plus Right? We cannot really do that function or call that function when only f is up. Right? So we have to, like, have the whole cluster up.
So, you know, this, you know, creates a little bit of trouble, you know, for us, in the on the deploy on the development side. Right? We have to think about, you know, we do actually do have two two two kind of languages and what do we put there? What where do we put where. Right? So this is basically, like, the first kind of limitation. And second is, this is really not that good if your cluster is very small. So, basically, FEMB are two kinds of processes. Right? And we cannot really handle, you know, the resource, management between the two. Right? We have to, like, complete, like, depend on Linux to do that.
It doesn't really do that that well. Right? So if you want to deploy in StarOS in production, what you will want to do is, to have dedicated machines, for your FE nodes. Right? And but when you only need, like, three, two b nodes, sometimes one b node is enough for your for your deployment, and you want a high availability for your FE nodes. Right? You want three FE nodes and one PE node, then that's, you know, a a a waste of, resources when your class is very small. Right? But, you know, StarRise is really built for distributed compute when you have, you know, many of really big dataset and really big concurrency.
Right? So that is actually not that big of a problem too. Right? I hope that answer your question.
[00:26:49] Tobias Macey:
Yeah. And then now digging into the use cases that StarRocks enables, particularly around that tiering of the lakehouse shared data and the high speed OLAP shared nothing, You also have the intermediary bit of being able to cache the lake house data. I'm wondering if you can talk to some of the ways that that changes the data platform architecture and the overall usage patterns that are modified as a result of that capability versus the current landscape of I either have my lake house or I have my OLAP engine, but I have to do some extra work to be able to join them together, or maybe I can use an iceberg table for, you know, in, like, my ClickHouse, but it's gonna be it's still gonna be really slow and just some of the ways that having that unification of the stack changes the overall approach the teams take to the design of their data systems?
[00:27:49] Sida Shen:
Yeah. That's a that's a fantastic question. So, yeah, we do have, just a little bit of background information. Right? So for the for the listener that are not familiar with StarOS architecture. So we do have our own table format and we do have our own file format. That's basically our internal tables. Basically, we talk about share nothing or shared data. We we are all talking about, you know, the proprietary, crazy kind of format that we built just for the best possible performance. Right? You have your new low cardinality, you know, kind of like storage level, like optimizations. Right? And you have all sorts of crazy indexes you can use. Right? And the option and the format itself is more optimized for low latency, high concurrency. Right? And that's basically a share nothing and share data kind of thing you're you're talking about. And also we have Apache Iceberg, right? And Apache Hudi and Delta Lake and Apache Pyman. Right? Those are open table formats.
And they govern ORC parquet files, and those are the open file formats. Right? And so how the whole thing works. So basically, the underlying thing is we want to preserve all of the things that are good about open table formats. Basically is single source of truth data, right, in the open format, right? So we want to with this as a requirement and how can we make the performance faster, Right? So actually, StarRocks is able to, you can actually create materialized views with StarRocks, right, from Apache Iceberg table. And what it does is that it's basically a materialized view is basically a StarRocks a managed StarRocks table in StarRocks' own crazy, own format, own really fast format. Right? So what you can do is create a materialized view. Right? And you can do some kind of, data transformation too if it's needed. Right? And that becomes, you know, StarRocks own kind of format. And also the whole process is managed. So you don't really have to, you know, think about refresh. You don't really think about, you know, is the data you sync. You know, if you configure it correctly, it's gonna work, and you don't really have to worry about that. Right? And, also, the metaverse view features a a automatic query rewrite. So, basically, if you do something like a pre aggregation or a denormalization with Star Atlas metaverse view, you don't really have to change your SQL query to benefit from, you know, from from materialized view. StarOS cost cost based optimizer can do can direct your query, you know, to the most appropriate multires view, right, automatically, so you don't have to worry about that. Right? So that's basically a way we figure out, you know, how to, enable this kind of, like, propriety level kind of, like, low latency, high concurrency, still on the single source of truth data in open formats.
[00:30:42] Tobias Macey:
And by virtue of being able to use one interface to query both your shared nothing data sources where you're guaranteed to have high throughput, high speed, and your lakehouse table formats. Obviously, being able to join across them is gonna include introduce slowdown, but it simplifies the logical approach to how to actually work across that data as well as giving you the ability to, for instance, materialize or build new tables off of the lake house data into the shared nothing architecture layer so that you can consistently have high speed through that. So I'm thinking in terms of, like, the medallion architecture style thing pioneered by Databricks or the typical d b t flow of staging intermediate mark where maybe you stage all of your data as iceberg files, but then in intermediate or in mark layer, you materializing those into the StarRocks shared nothing layer. I'm just wondering what you're seeing as some of the ways that people are approaching that style of workflow and some of the the benefits that they're seeing as a result.
[00:31:49] Sida Shen:
Yeah. So yeah. That's a great question. And, actually, there are Matron is used basically by, like, almost all of our users right now. Right? So directly query, we can get great performance. You can think about, like, around three to five times, you know, compared to Trino, right, with cache all for cache all. And but Matura's view just take this to, you know, a whole other level. Right? This is basically a painless way for you to add, denormalization pipelines or pre aggregation pipelines on demand. Right? The way I say on demand is because of the query reliability. Right? So you really don't have to worry about whether this query is still gonna work after I build this pipeline. Let's think about you build something with Spark, right? Build a demoized table with Spark.
You have to change your SQL script or you reconfigure your entire dashboard, you know, for that thing to take effect, right? And that's really painful for our users, you know, before they move to StarRocks because you basically have to figure out all of the the pre processing pipeline before you actually do your application and run your application because you have to actually, like, go bother your user, you know, to change your SQL script. And that's nothing that anyone wants to do. Right? Yeah. So with this, you can actually, you know, just build your application, right, and figure out the performance afterward and to add demonization pipelines afterward. And you end up adding so many less pre computation pipelines. Right? And also, you end up going to production a lot faster, you know, with query rewrite capability of materialized view.
This is, you know, basically, the main way of how our users are basically blending our proprietary kind of format, optimized format with like an open format, like Apache Iceberg. Right? And there are actually other use cases too. You know, it doesn't only help with query. Our proprietary format has also has a primary key table, right, probably familiar with. It has a record level index, right, that's able to accelerate your data freshness to like sub ten seconds, you know, directly on top of columnar storage. And that's, you know, that has to do something with memory. So So that's not something that Apache Iceberg can do. So we actually have a user, that use primary key table as the caching layer when they basically ingest the data in. So it first lands in the primary key table. So you get that, you know, ten second data freshness.
And they build a whole mechanism to basically sync the data from your from their primary key table to Apache Iceberg, like, every hour. Right? Delete and do a truncate of the table, you know, on top and make that whole, like, atomic operation. So that's a crazy workload. Right? So they actually presented this whole thing, last year at the Iceberg Summit, as Tencent Games. So they were basically able to get sub ten second data freshness on top of, like, petabytes of data. Right? And single source of truth still on Apache Iceberg and running that with MetaMask View, running that off, like, like, sub second query query performance. It's just some crazy numbers If you come if you actually combine the two correctly. Right? This is that's probably, like, one of the most exciting ways you can use StarRocks, you know, I've seen.
[00:35:11] Tobias Macey:
And that was gonna be my next question. Was the reverse of landing your data in StarRocks for high rate throughput and being able to do read after rate querying in those other applications. So either using it for, an application storage layer, maybe not necessarily as the primary database, but for being able to push analytical events similar to the use cases that ClickHouse is often deployed for, and then being able to periodically flush that data down to iceberg so that you maintain that as a raw storage table, but you're also still able to use it for being able to power interactive use cases as the data gets rewritten?
[00:35:51] Sida Shen:
That's a really exciting thing. I mean, that's not a problem that I think I think open table formats are looking to solve. Right? And because that's a lot of that is engine specific, you know, a lot of that depends on memory. Right? So, you know, that's one really innovative way you can solve that problem, have that data freshness.
[00:36:08] Tobias Macey:
In terms of the lakehouse support, you mentioned Iceberg, Delta, Hoody. There are a large and growing number of table formats that are available. Obviously, they all have different use cases that they prioritize. And I'm wondering how the work that you're doing in StarRocks focuses on either subsets of that functionality or just the overall problem matrix of supporting all of those tables and being able to support all of the features versus what are the the core elements of those formats that you care most about? And particularly as people ask you to add more and more tables to the list.
[00:36:46] Sida Shen:
Yeah. That's a great question. That's a great question. So, for all of the table formats, I think almost all of them, I think of, you know, even like just Iceberg Delta, Paimon, and Hudi, and even Les. Right? They all, like, born with similar reasons. They're all solving this very similar problem, which is Hive table just doesn't work anymore with the scale we're operating on, with the frequency that we want to want to update our data. Right? The frequency we want to add new columns, to delete columns, you know, to do schema evolution. Right? It's not keeping up. And I think they always solve very, very similar problems, but they started at different places. And I do see a lot of the table formats converge, you know, and have a shared functionality a lot more than we see like two, three years ago. And yesterday at the iceberg, Isober Summit, a lot of the features that are introduced like deletion vector, like, let's say, geospatial kind of features, they're shared between Delta and Iceberg.
So I imagine actually the opposite. So going forward, we should be able to see a lot less proprietary features, you know, that are in those table formats. And also what helps more is that we have innovative projects right now, such as Delta Uniform in Delta two point o and Apache x table. That basically, you know, is one copy of Parquet files, and it generates three copies of the metadata. Metadata is a lot a lot a lot lighter, you know, so they can, you know, three three copies of metadata and one copy of the data file. So in that way, you know, we should be able to work with more like a unified interface, for our query engines. That's, tomorrow. That's, that should come that we really want to come, like, actually tomorrow.
[00:38:38] Tobias Macey:
And then for cases where teams already have some investment in one of these other query engines, whether they're using Trino, Dremio, etcetera, for their lakehouse use case or they're using Snowflake, etcetera, for their warehousing or they're using ClickHouse or Druid for OLAP or maybe they're using some combination of all of those. What are the situations where you typically see teams addressing StarRocks as a supplementary tool versus a replacement of any or all of those?
[00:39:10] Sida Shen:
A supplementary. That's more I think we see a lot more of that, you know, with query edges. Right? Because the heart of, you know, like, lake house is basically one cop one giant copy of data and 15,000 different tools on top of it using that one one copy of data without moving moving the data. And that really sparks, you know, innovation on our side because that really drives specialized tools. And StarRocks is that type of specialized tools on the lakehouse. Right? And we're probably the only one out there that is really hyper focused on high concurrency, low latency, reliability, you know, those kind of like customer facing, like, workloads. So we see do see a whole ton of StarRocks running site on the side of, Apache Spark.
So Spark do a lot of the the business transformation, do a lot of, like, the compaction. Right? And we handle a lot of, like, the interactive queries and we handle a lot of the, like, the high concurrency, like, the dashboards and the reports. And also, you know, like customer facing, like, application that they're directly serving all that queries to their customers and run everything on the fly. So that is probably the biggest. And, also, we do see a lot of ClickHouse user, but that's more often a direct replacement. And they, you know, they they just look at look at their, like like, 10,000 denormalized table and she get frustrated and look at their build. They, you know, they cry. Right? And they wanna find something that can run joints on the fly, and, you know, they find us. And later, they discovered, you know, even more reason, you know, to stay on Star Wars because of the scalability and because of the data upstairs, because the real time kind of, you know, real time data upstairs, you can actually, you know, have your data change. What a surprise. Right?
Like, in real time. So they end up, like, slowly, slowly, like, replacing the whole system. But denormalization is basically most of the reason that a lot of the ClickHouse choose StarRise in the first place.
[00:41:11] Tobias Macey:
Digging more into the ClickHouse comparison, one of the things that gives them a built in advantage at the moment is the overall ecosystem investment around it, where they've been around for a while, they have gained a good reputation. There are a lot of systems that presume ClickHouse as their storage layer. I'm thinking even in particular of some very new systems such as, AI gateways that rely on ClickHouse as the storage layer for storing some of user interaction data, telemetry data, etcetera. And I'm wondering in terms of that piece of the puzzle for StarRocks, how you're approaching the
[00:41:50] Sida Shen:
ease of integration, ease of adoption from a protocol and ecosystem layer. Protocol and ecosystem, we definitely are spending a lot more of our effort on those. But I think in our experience, that's really not a a huge blocker for a lot of the users because a lot of things they're experiencing when they're going trying to go to production, when they're trying to scale, and those problems are much more difficult to solve than a lot of the ease of, you know, a lot of the, the higher level kind of issues. And Star Wars is not really exactly yeah, we do lack when compared to ClickHouse, you know, functions. But we are working with the community, you know, to make Star Wars a much more a mature product.
[00:42:32] Tobias Macey:
And digging a bit further into my brief mention just now of the AI ecosystem, obviously, that's another area that has been driving a lot of development in terms of completely new data engines as well as feature additions to existing ones. And I know that StarRocks has an array data type which can be used for storing vectors. And I'm just curious what the current stance is of StarRocks in that broad ecosystem of AI applications and being able to enable them and deal with vector embeddings, etcetera.
[00:43:04] Sida Shen:
Yes. So I think I think it's it's better it's it's good for me to to give, like, a general picture and then, like, some features, explanations. Right? So so first, forgive me to going back to Lakehouse again, but this is really Lakehouse. So, Lakehouse, they did that a long time ago, right, since 2017, '20 '18. It didn't really become, like, that popular, that big until just a few years ago. Right? The reason for that is I think AI is really a big push, you know, for the lakehouse adoption or the lakehouse kind of like conversation. Because like with data warehouse, with, you know, just think about like data warehouse kind of workloads. Five, ten years ago, it's probably possible that one system can take care of, you know, all of your needs. Right? But with AI just right around the corner, nobody knows what's gonna come tomorrow. Right? What kind of new tools I'm gonna have to use the next day?
So user taking ownership of their own data to store their data in parquet files instead of, let's say, Staros file format, right, has become a lot more important in the past five years just because of the, you know, the boom of AI, machine learning kind of workloads. And that's, I think, a lot of the reason why we started to build this whole, like, lake house integration. Right? Because we do want our user to have their own data, to own their own data, and to have the freedom, you know, to choose what kind of tools they can use tomorrow. So that's more like the the lakehouse kind of conversation. And what makes lakehouse even more important today, I think, is is the the the birth of agents, analytical agents. Agents are they work a lot faster. They don't really have to sit there and think. You know, they can go through like 10,000 tokens, like a few seconds. Right? Having a universal view of your entire dataset inside of your organization is something that's very important for AI agents. So you want to have that very strong single source of truth kind of data layer inside of your organizations to better leverage, you know, AI agents. Right? All sorts of agents that is not only, let's say, analytical type of agents. And that I think, you know, give it even bigger push. That's the whole like lake house thing, right? I think just talking about agents, you know, agent is actually a lot of users in our community already building analytical agents with StarRocks. I can't really say who yet because they're still figuring it out in the early phase. Right? But we do see a lot of very exciting new projects at Como. I think StarRocks really pair well with agents because with agents, you have to have agent scale kind of concurrency. Nobody know, like, how big that's gonna be. And low latency because we're gonna become, like, even bigger kind of a requirement because they work just so much faster than I do. And also you have automation in your pipeline. You have less human in the loop. You need your query engine to be able to self optimize, right, to not have that many knobs you have to talk. You you have to you have to move, right. You have to be able to learn from, the queries that's previously executed to constantly evolve and to make sure that your query plans are always, you know, the most optimal. And we do have that future building, by the way. And I think just a characteristic of StarRocks is really good to solve a lot of the agent problem, that we see in the industry today. Right? And so that's basically, you know, the bigger thing, right? The more forward looking thing. And feature wise, you know, yes, Tencent Games, did contribute the vector index functionality.
Right? So you can use StarRocks to basically do a rack, right, or to do similarity search as Tencent Games. Tencent do they think they do like a reverse image search. And you can do a multi you can do like a hybrid search, right, because we are so good at SQL query. Right? You can do a lot more, you know, with your with your vector search with StarRise. And that feature is is available if you want to try that out today. And also to better incorporate in the whole AI ecosystem, we really strengthened our Python integration. Right now we have Arrow support, now we have a Python UDF, and we're gonna have a lot more, you know, going to the future. That answer your question.
[00:47:26] Tobias Macey:
Absolutely. Yeah. Definitely a lot of exciting stuff to dig into. And in your experience of working on this project, working in the community and with your customers at cellular data, what are some of the most interesting or innovative or unexpected ways that you've seen StarRocks applied?
[00:47:43] Sida Shen:
Okay. So, let's think, you know, the first one has to be has to be Tencent Tencent Games, right, with their crazy kind of, you know, subtesting and data freshness directly on top of Apache Iceberg with petabytes of data sub second latency. Right? That is that is a very innovative way, you know, of utilizing our our format with open formats as the lakehouse, right, innovator. And something also is, I think, a lot more interesting is lakehouse, StarRocks based, or StarRocks powered lakehouse, or StarRocks and Apache Iceberg are actually solving those problems nobody has believed that is possible. Like customer facing analytics, right? That's something that requires very low latency, high concurrency and stability of your latency, reliability of your latency. Right? We have users that use StarOS and Apache Iceberg in production, serving their customers, their paying customers. Right? So first example can be let's think about Herdwatch.
Herdwatch is actually a customer. They build like customer facing dashboards for cows, for like livestocks. They have like tens of millions of livestock they have to monitor for their customers, which are, you know, farms. And they use StarRocks and Apache Iceberg to, to basically, replace Athena, right, and brought down, create latency from like tens of seconds to like sub second. At TRM Labs is is another one, I've, I've mentioned too. Right? And we have a lot more use cases such as customer phasing, such as Pinterest. Pinterest use us to, you know, do that kind of you think of, like, YouTube analytics page. Right? I'm sure you're familiar with that. So those for, like, content creators and for advertisers, you'll see in real time how their ads are running, right, to make a real time decision on those. Right? With that and with Coinbase, a customer phasing, just a a bunch more. We have a bunch more case studies on the starus.io website. You can check it out. And it's one thing I wanna wanna mention, unexpected.
I did just talk to a user last week. We have that primary key table that can do real time data ops search, but it's not a TP database. That guy is not doing OLAP. That's that guy's using StarOps as a transactional database, a transactional database only to serve his operational workload. It's been working fine for, like, a year and a half. I just heard about that, like, last week. Yeah. That's probably the most, like, unexpected way. You know, people use it. You probably shouldn't, but, you know, it's it's it's good to see. It's always fun to see.
[00:50:24] Tobias Macey:
And in your experience of working on this technology, working with this community, and in this problem domain, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:50:35] Sida Shen:
I think challenge wise, we've been through so many difficult things. I mean, building our cost based optimizer was a lot more difficult than we thought. It's basically an optimization problem. It's basically, you know, we do need to work with the community a lot in the first, like, two years to make it basically, we can confidently say, you know, a year or some change to confidently say that we can turn it on by default. And thank God we're open source or else a lot of the features that we built today would not be possible without our initial adopters, trying out and showing us the crazy stupid query plans that our CPO has generated.
Right? Without those, you know, that wouldn't be possible. And I think the biggest challenge that we see is basically a scale challenge. We do test our system really rigorously. Right? We think terabytes of data that's that's that's big, 10 terabyte of data that's big. But once we get put into production systems, when 10 terabyte is not the data size, it's a scanned data size after pruning, right? And that's very difficult, right? With hundreds of StarRocks instances with like each with like fifty, fifty plus cores. And a lot of the problems do show up. You know, you see a lot more bottleneck when you see you find new bottlenecks in your system, you know, when you push a system to that kind of scale, right? For example, when there's a data skew, right? Shuffling can be, you know, can be a huge challenge. And that's a problem we solve for like, you know, we spent like almost half a year to solve, Right? We didn't even think that was there, you know, in the first place. And that's also, you know, thanks to our to our open source, you know, us being open source and our open source users. You know, we've been iterated for the past, like, five years to make this system, you know, as stable and as performant and as reliable today.
[00:52:23] Tobias Macey:
For people who are interested in simplifying their data stack or they really care about speed and latency, what are the cases where StarRocks is the wrong choice?
[00:52:34] Sida Shen:
Yeah. We have to talk about that guide again. Yeah. Don't use it as a transactional database. There are better options out there. If you want to, you know, analyze your transactional database, read the CDC and use something like a, Debezium, right, to CDC it into StarRocks, and we support updates. So, you know, that works. Don't use it as a transactional database. That's probably not your best bet. And also for very large ETL kind of workloads that runs for, like, hours or days, you're better off probably using something like Spark, something that's designed for those kind of workloads, especially, you know, if you're on a data lake. You can just run the tools, you know, simultaneously on top of one one copy of data.
So yeah. So that's basically it.
[00:53:18] Tobias Macey:
And as you continue to build and iterate on the StarRocks engine and the ecosystem around it, what are some of the things you have planned for the near to medium term or any particular projects or problem areas or or capabilities that you're excited to explore?
[00:53:33] Sida Shen:
Yeah. So this year, we're really looking to to further stabilize, you know, the query performance, not only just for your for our own, like, shared data or shared nothing kind of architecture. Right? But more for like Apache Iceberg. Right? So how to deal with, unmaintained or less maintained iceberg metadata files, right, how to better read them parallelly, right, how to better cache them, right? How to make that metadata cache more reliable, right? And also how to make our own data cache more reliable. We've already built a lot of features just to solve that specific problem. And we're going to work with the community and to make those features more mature and more robust, right? So you can end up running the craziest kind of workloads on your open data, on your own data. And this is exactly what we want in the future, not only for your data kind of workload, but only also for your AI. So we take apart, you know, into that big evolution, making everything open. And that is really what we're looking for. Further power customer facing analytics on that open type of format.
[00:54:43] Tobias Macey:
Are there any other aspects of StarRocks, this overall space of high throughput, high speed queries across data lake, and, shared nothing data layers and the capabilities that it enables that we didn't discuss yet that you'd like to cover before you close out the show? Yeah. There are a lot more. I mean, I think one aspect is
[00:55:05] Sida Shen:
is caching, if you're interested in, we can talk about. Right? Caching is something that we've built actually pretty long time ago for for our shared data architecture. Right? Actually, our first iteration of our caching was it already looked really good on benchmark. That that's why I don't really like like benchmark that much because it doesn't show the full picture. Right? If you run everything hot on the benchmark, which everybody do, it doesn't show, like, actually how how well your cache, you know, can hold hot data. Right? But in customer facing analytics, that's, like, the most important aspect of of a cache, how to prevent your entire cache getting flushed by a big scan and how to make sure that if some kind of a service triggers a query, you make sure that, you know, there's cache there. You're not reading off S3, like a hundred milliseconds per scan. That's not acceptable. Right? So we build a whole bunch of things, you know, just for the cache itself, like a segmented cache. You know, we have a mechanism to protect hot data, you know, from it being, like, scanned by, evicted by BigScan or BigQuery. And we have compute replica. So basically, we have multiple replica of the same cache stored in different nodes that if one of your EC2 instance fail, you still have your performance SLA. Because, like, for customer facing, we actually see a lot of performance SLA higher than more strict actually than EC2 instances, availability of EC2 instances. And, you know, it's just a lot of things. And I think for cache, you know, there's another thing. Like, for example, like, cache warm up. We have a warm up mechanism for cache. That's really not really that conventional.
You know, you see in those, like, interactive query engines because, like, stability of ins of latency was is really not that much big of a deal in those systems. But for customer facing, we cannot have, we cannot afford, you know, to have any kind of slow queries. And that's just caching. And we build so much more stuff, you know, to to help you stabilize your query. You can, you know, go to our release note, and we have a bunch of features that just got released, you know, the three dot two release oh, the three dot four release that you can check out. That's basically all I have. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the StarRocks team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today? There's so many gaps.
There's so many gaps. I think the biggest gap right now is the requirement from the customers changed so fast. You know, from just a cloud data warehouse, is good enough for me to I want to run everything on Parquet files. You know, this is a big shift. But a lot of the tools that we're using today were developed, like like, five, ten years ago when this requirement was not a thing. Just think about, like, in our field, query engines. We we weren't, like, thinking about, like, building the fastest query engine for Hive for Hive table. Right? And back then was, you know, like, connecting to all of your data sources and get okay performance or ingest into a proprietary format to get the best kind of performance. Now we want both, and the tools are not here yet. And we try to close that gap, but there are more gaps in this field, you know, just this transition of, you know, to open formats. So I think this is definitely,
[00:58:28] Tobias Macey:
the biggest gap we see right now. Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing on StarRocks and the capabilities and use cases that it supports. Definitely a very interesting project, very enticing technology that I intend to spend more time investigating. So I appreciate the time and energy that you and the rest of the team put into that, and I hope you enjoy the rest of your day. Thank you. Thank you. Appreciate that. Thank you for listening. Thank you.
[00:59:01] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to the Podcast and Guests
Sida Shen's Journey into Data Engineering
Overview of StarRocks and Its Origins
StarRocks in the Competitive Landscape
Architectural Design of StarRocks
Data Partitioning and Sharding in StarRocks
Use Cases and Benefits of StarRocks
StarRocks as a Supplementary or Replacement Tool
Challenges and Lessons Learned in Developing StarRocks
Future Plans and Developments for StarRocks