SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out Linode at data engineering podcast dotcom/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it.

Go to data engineering podcast.com/gocd

to download and launch it today.

Enterprise add ons and professional support are available for added peace of mind.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

Your host is Tobias Macy. And today, I'm interviewing Yaron van der Heiden about Siri DB, a next generation time series database.

So, Jeroen, could you please introduce yourself?

Yes, Tobias. Thank you for inviting me to this interview.

I'm,

I started my career as a system engineer and then shifted more towards development.

And then about, 5 years ago, I made the the switch to become a full time developer

at the company I currently work with.

At this company, we are building a solution

monitoring IT infrastructure and IT components.

And,

therefore, we are collecting a lot of data.

So, basically, that's what we are doing right now.

And how did you first get interested or involved in the area of data management?

Like, it it it was at this company that we,

needed to build, like, this this system for monitoring all this IT infrastructure components

and stuff like that. So, yeah, we needed to find a way to store all this data.

So we automatically get got involved with all this,

data engineering.

This is also the time when we we started to need needing a time series database.

And so can you take a minute to describe what Siri DB is and,

how the project first got started?

Yes. It's

a it's a database for storing time series, and time series are, like a metric, for example, where you store points in time. So so each point in a time series database has a time value

and, yeah, an actual value. Like, for example, you can measure a CPU of your notebook or whatever,

and you want to store, like, at each second, you want to store the value of the current CPU usage.

This is stored in a time series. We we call it like that or a metric.

Yeah. 0 DB is a database which is specialized in storing that type of data.

And, yeah, we started this because,

the monitoring

solution that we are building at our company

just, had a need for such database.

Because if you wanted to store everything in a SQL like database or something like that, yeah, it will probably not scale and and not fit.

So

And what is it about time series in particular that requires a different

format of database and way of storing the data that doesn't work with more traditional models?

Well, it scales in 2 dimensions. You you can have, like, a lot of series, which you could translate them to a table in, for example, SQL.

But if you

would do

this approach, then you would need a time series for every metric that you want to store. So that would require like a lot of tables.

For example, with our monitoring solution, we currently

say we store over more than 2, 000, 000

components of metrics, so to say. So it scales in 2 dimensions. It scales in time, and it scales in the number of metrics.

And I think that that's different from,

like how SQL usually works or

a traditional database.

And just briefly, where does the name come from?

Well, my daughter's name is Iris.

In Dutch, you say Iris.

And turning it around gives Siri, and then it's a database storing time series. So

it's a basically series of points.

So that's how we got the name. So

we just stick with the name. Yeah. That's

clever. Thank you.

And what was the landscape

of time series databases like at the time that you first began working on Siri that you felt it was necessary

and you couldn't just use 1 of the off the shelf options?

Well, at at the time, Influx was just available as a beta version,

and I believe InfluxDB was using LevelDB as their underlying storage system.

We actually used InfluxDB

for some time, but, this was a very early version, and it had some issues.

And, yeah, of course, there were other options like, openTSDB.

Yeah. But in general, I think it it it was a time when time series databases really started to grow. And

just as a proof of concept, I started 0DB and just to see, like, can we do this ourselves?

It was more like

a proof of concept, so to say. Probably,

yeah, we could have used something else, but we we only tried Influx, and, yeah, this was, at the time, too new. So, yeah, that's why we started,

to create our create our own time series database.

And now that you've got it to a point where it's production ready and being used, how do you feel that it compares to the other options such as Influx DB that you mentioned or Timescale which is a somewhat newer 1 or Kairos or as you mentioned open TSDB?

Well, I I think we are the most similar to InfluxDB.

I think we we are often compared with InfluxDB.

But, yeah, I like the the idea of time scale

because, yeah, they are fully compatible with SQL and everything, so that's that's also a nice time series database.

I think I think a key difference is is that

started off with its own storage system. We we never relied on something else.

I know that Influx DB at the moment is also,

switched to its own storage system, but I believe back then, they used LevelDB, and they switched

to something else in between. BoltDB, I remember,

if I remember correctly.

But,

yeah. Yeah. I think 1 key difference is that we, from ground,

yeah, from from the beginning, had our own storage system

underlying Siri DB. Now probably there are a lot of time series databases which which do the same, but that's at least 1 thing which is different from the ones you mentioned.

And do you feel that Siri is

more suited to particular use cases than some of the others or are there areas

or feature sets

where they, where Siri differs drastically from some of the other available options?

Well, not drastically, I think, but, it's like a combination of things.

Like,

we have the combination that we are scalable.

We can,

yeah, we can scale across

nodes, and, yeah, not all the order time series databases

have, the scalable ability in in themselves yet.

I know that Influx DB has scalable

options, but I believe the open source version does not, and Timescale is still working on that. So in in that sense,

yeah, we are different from that ones.

Then, of course, like, I think it's it's,

it's when you, have a lot of metrics,

and each metric has, for example, a 1, 000, 000 data points. So

we scale pretty well in both having a lot of metrics and both having a lot of data points,

yeah, when you are only

required to use, intake and, floating point values because that are the only values that CRDB at a point is able to store,

then I think Siri DB is a good, good option.

And on the point of the data types that it supports,

as you mentioned, they're floating point or integer values, but is there any capability for being able to store things like

events

where you might want to indicate, for instance, that a deploy happened

or,

any particular

just single point in time as a reference,

or is that something that you would store in some external system and then merge in at the time that you're trying overlay it as you're displaying the data in some sort of dashboard?

Yeah. Currently, we are storing this in another database, so we are not using Siri DB for that purpose.

But, actually, at the time, I'm also developing this in Siri DB. So it it might be that,

in a few weeks or a few months at most,

we should be able to store this type of values in CWDB

itself. But at the moment, we we just focused on integrals and floating point values. So this is something that which will change in the future, but at the moment, we do not.

And can you take some time to describe how the server is architected and how the internals of it work and how that design has evolved over the time that you've been working on it? Well, I've I first created CODP as a single node and,

yeah, only later I I added, like, the cluster mechanism that it has. If you take the single node, then,

it accepts

data, which is coming in, and it first stores all the data in a sort of a rider headlock, or you can call it a buffer, what you what you want. And then when it collects a certain amount of points, then it's

these points are moved to a shard. And you can see a shard as a as a fell in time. Like, for example, 1 shard can store, like, a couple of days or maybe a couple of hours. You have to choose that when you create a database,

and,

the shards are getting optimized over time. So if you store, for example, a few chunks of data inside a chart, then CRD Bay will run an optimized task over the chart so that these points are sorted and

they are already sorted, but they may have overlaps. These overlaps can happen because allows

you to write points from the past.

For example, if you miss some points, you can later add them to CW DB. But this way,

on the disk, you can get an overlap in time, and we want to sort this overlap out. So we run we run an optimized task over these shards

so that they are written optimally.

Yeah. Then, man, I had this single node fully working. I extended c r d b so that,

it works like a cluster, and you,

yeah, can make security c r d b scalable and fault tolerant.

So that that's some there are things which I created later.

And how does that clustering mechanism operate now that you've got it functional? Well, it's a assigns time series or metrics to a pool. It's based on that time series name. So when you add a new pool for example,

then CWDB sort of reindexes the time series,

so that each

existing pool moves a part of the series to the new pool, so that when this process is finished, all pools have approximately

an equal amount of series.

So your data is always spread across multiple pools,

and in a most equal way.

It is important to note that when series will move

from, 1 pool to another, that they only move to the new pool. So if you add a new pool to your cluster, then only CERiS will move from existing clusters to this new pool, and there is no re indexing between the existing pools.

Then it is also possible

to add a second server to each pool, so that your database has some fault tolerance.

And to sever the second server, it acts as a active replica.

So when they are both online, they can both process queries,

and there is no really master slave,

in that sense.

And is it possible to add more than 2 servers to a pool in case you wanted to, for instance, spread the pool across 3 availability zones in AWS, for instance? No. It's it's only possible,

to have 2 servers in a pool. So we do the same thing with oversight. We spread it across 2 locations.

But at the moment, at least, it's not possible

to use 3 locations.

And this is mainly because synchronizing is a lot easier when you limit yourself to 2 servers in a pool.

So

maybe we we change this in the future, but at the moment, we stick with this limit just because it's easier for synchronization.

And do you have any issues

with,

conflicts in mirroring that data between the 2 servers in a given pool, or is the data, as it's written just balanced between the 2 using, for instance, a load balancer and then they just fill in the gaps in each other's data storage?

It's just because that we limit,

everything to 2 servers

for replication.

It's a lot easier to prevent this to happen If you if you, for example, would add like,

would allow like a search server, this problem becomes

much more, how do you say that? This problem becomes much more bigger, you know? Like, it's quite easy to prevent if you limit yourself to 2 servers, because like, 1 server is

receiving data.

It can store it on its disk in a in a sort of buffer, you know, and then then the other server is online.

Or if it's online, then it can send immediately the data to the 2nd server, and it always knows, like, which server still needs to receive

data points

or if that server has already had its data point because there's no way 2 servers are updating the third 1, for example.

And when the data is being ingested,

is there some routing mechanism,

for instance, if you're trying to write to

write a metric to a point in the server that is belongs to a different pool, does it automatically route that for you? Or is if you can just describe how that works?

Yeah. Yeah. You you just try you can just choose 1 server in the cluster. It doesn't really matter which 1, and you just start writing to that 1. And that server knows about, all the series. It doesn't really know about all the series, but it knows that if a series exists,

then it knows in which pool it exists.

For example, you

come up with a metric,

then the receiving server knows that this metric

must be living in, for example, pool 1 of pool 2. So it forces

this metric to the correct pool. Each pool just knows

by an algorithm

to what

pool a metric belongs. So that's how the the metrics are spread across the pools. And is it using some sort of

algorithm because what we don't want is that if you,

scale, for example, from 2 to 3 pools, that metrics, which,

by the algorithm are assigned to pool 0

can move to pool 1 when you add, like, a second pool. Because this this would mean that when you have, for example,

20 or 30 pools, that you get a lot of traffic between the the pools when you just add a new 1. So we don't want that. We we only want, like, the minimal amount of of series

to move to the new pool.

So the algorithm is it is like a hashing, but it's it's a just a different algorithm to assign the series to a pool, just to prevent from series,

yeah, being transferred back to a pool that they came from, so to say.

And is it possible to scale back down if, for instance, your data volumes decrease and you don't need quite as much throughput?

And what would that process look like as far as decommissioning a given pool to scale back in?

Well, at at the moment, it's not possible at all, but it would be very easy to, to make it possible to to to scale back and remove the last pool.

If we wanted to make that in the future, but right now we don't have this. But if we want to create that, it would be very easy to remove the last pool. But it will be rather difficult

to remove, for example,

the 1st pool or the 2nd pool if you have more than 2. So, yeah, maybe we will add this in the future, but at the moment, it's not possible at all. You do, however, can remove a replica or rebuild a replica.

So if something,

yeah, really happens with a with a server, you can rebuild this 1. So you can remove a replica and rebuild it. That that's possible. But truly scaling down in number of pools is at the moment, not possible.

And given the fact that it does have clustering capabilities,

I'm curious what the failure modes are for a given deployment

and where Siri DB falls on the spectrum for the cap theorem?

At least 1 serve in a pool must be reachable. So if if

c o d b

has a pool completely down, like, both servers are are are down in a pool, then the whole system doesn't work anymore. So,

queries and inserts will not be accepted.

As long as,

each pool has 1 server online, the the system keeps working.

Yeah. If you look at the replication process,

I think we,

try to be as consistent

as possible with the data.

So we yeah. Before we send a package to the replica server, we first, save it in a buffer on the disk. So if something happens to the server, if if it falls out or the network connection drops or whatever, it's still stored on disk, and then everything is back online. It will be replicated to,

your server. So in terms of the replication, yeah, we try to be as consistent as possible. And in terms of,

scaling across pools, like, yeah, each pool must be online. So that's, yeah, this can be a problem. It's a good idea to spread, yeah, your servers in 2 locations and make sure that each pool has 1 server at each site, so to say.

And as far as the databases themselves,

you mentioned that when you create the database,

the shard that it uses for storage, you need to set a boundary of how long those metrics will live for. So I'm wondering

what's the reasoning for that and what happens to those individual metric points as they age beyond the time horizon for a given shard?

Well, that's that's a small misconception

because, yeah, it's the the duration of a chart actually what you are referring

to. But,

it just

it just creates more charts. So Siri will never throw away data. It will always keep

the data, but just in different charts.

So the duration you are referring to is the duration of a single chart. Okay. And

we make this configurable

because sometimes you want, for example,

metrics with an interval

with approximately

1 data point every second. So a good shot duration would then be like 6 hours or maybe a day or something like that. Sometimes you have like the total order spectrum, and you have only like 1 metric each or 1 point for a metric each day. And then you have like, you want to store maybe

like 10 or 20 years of data or maybe even more. So a different shot duration would then perform better. So it's actually for performance that,

I made this,

configurable. But 1 downside is that at the time you need to choose this for the whole database.

So it's

not possible to, choose for individual metrics a different charge duration, and I might change that in a future release. Like, I want this to be more dynamic that you don't have to choose it anymore, but CWDB

decides by itself

what the shot duration is the best for this

series or metric, whatever you want to call it. But at the moment, you have to choose it yourself in creating a database.

And 1 of the common issues that happens when you're using a time series database in an operations context

is the need for having high cardinality of the metrics where you have multiple different ways to identify a given data point whether it's the stating that it's the number of I o operations for a disk, but also saying that it was from a given server with a particular purpose and, you know, maybe the environment that is being deployed in. So I'm wondering how the metrics in Siri are identified

and whether it has support for adding that high cardinality for the metrics or any sort of tagging capabilities?

Well, at the moment, we are using the metric name as a as identifier. So you must be careful choosing a good name. Like for example, we include, like, all these things inside the name of the metric. So besides that, we also support, like, the dynamic grouping. So you can create a group for a collection of series based on regular expression.

That way you can, for example, create a group of all your

CPU or all your memory or just a customer or something like that or a location.

And then with these groups, you can perform set operations. So you can combine them, yeah, just with with everything which, is allowed with sets, like intersection, difference,

things like that. Yeah. We are also working on a, another tech system, which allows you to, take, like, individual series, and

then you could use them just like groups.

But we are kind of working on that right now. Yeah. The the current groups we are using are, dynamically updates. So when you add a new series which has a match

with your group, then it automatically will add to this group. So for example,

yeah, you have to choose a good metric name. But if you add a new metric

where the names,

corresponds to a certain group, then it will automatically be added

or dropped if you remove the series. Yeah. This is the way that's,

more or less how Influx worked, and we we came from Influx. So this is a little bit the reason why it works like this.

And

you mentioned briefly that you created your own query language for Siri DB. So I'm wondering

what were the design considerations

when you were planning out what that syntax would look like and I'm wondering how well that has stood up as the database has started to gain more use.

Well, we we started very simple, just like with select statements and stuff like that. And then, yeah, we,

had our own product, our own monitoring product, which required some,

some expressions and aggregation methods and stuff like that.

And, for example, what we really like is to combine

aggregation methods

together. For example, sometimes we want to take the difference of

metric

and then difference

take difference again and again,

just to, flat the series out. So we want our query language to be able to perform that kind of things.

And another thing, which we also

needed for oversight, our monitoring product,

is the capability

of merging metrics together. For example, if you have, like,

let let's let's take the example of

the CPU you are monitoring. Like, you take all the cores and you want to group them together and present them as a single

a single metric. The query language should be able to

allow you to say to Siri debate that you want this. Yeah. Maybe I'm explaining this a little bit, in a rough way, but No. That makes sense. Yeah. And so Siri has been built with a strong focus on being used in the context

of systems operations contexts

for use

contexts for use? Now, yeah, early this year, we we made a CVDB

open source.

So now we, yeah, we are hearing of other projects.

And, 1 of the things I heard about is that they use it for, like, a weather system. So they store, like, all this weather information on,

I believe they are even using

Raspberry

Pis sometimes to store this data on. C o d b also runs on Raspberry Pis. So

they are using it for that type of data, which is totally different from our use case. But in a sense, it's it's just time series data, so it's,

it's possible to use it, for that purpose.

Another 1 I heard about was, financial data. So, we actually, as a as a demo 1 time,

scratched all the data from, the Yahoo Finance side and put it into Siri DB. So these are other use cases, but,

yeah, but you're right that,

we created it mainly for

monitoring IT infrastructure

systems.

And

1 of the things that's worth noting is that the database itself is written in c, which I imagine is why it was able to be deployed to Raspberry Pis because it doesn't have the additional overhead of running on something like the JVM.

Yeah. Yeah. That that's correct. It's,

it's all written in c. However,

we have built some, tools to connect to serial DB. For example, we have, like, a client available,

and this client is written in Go. So, yeah, I sometimes hear a misconception that people think that it's written in Go, but it's actually only the prompt, the client which is written in Go, and, CRDB itself is written in c. And I noticed that you also have client libraries for things like Python

and Ruby for people who are using those languages as well and also an HTTP,

plugin for being able to address it for things like Grafana, for being able to build dashboards on top of it?

Yeah. We mainly create this, this HTTP

library or add on, so to say, to be able to use, like, any

programming language you you want because

you usually should be able to connect using that that API.

But,

we we certainly want to support as many programming languages as possible

to have native clients. And we have,

available for Python,

for Go, for c and c plus plus, and Node. Js. But we would also like to add, for example,

at least,

Java, which is, on top of the list, which,

yeah, we really would want.

But, of course, like,

everyone to extend this to other programming languages as well as Ruby and maybe PHP,

and stuff like that. Yeah.

And it's also worth talking about the process

of open sourcing it because as you mentioned, when you first began work on it, it was closed source. And then at the beginning of this year, you released it. So I'm wondering what was your reasoning for

releasing it to open source, and what was the reason for that particular timing?

Now the particular timing is that we felt that the product was more ready

because, yeah, before that time, we were still in sort of development process. We were using it internally, but it was not really ready to release yet. And we decided to make it open source mainly because we think that,

it's nice to know for our customers at least that

their data is stored in an open source database. So for example, if you are a customer for from our monitoring solution,

then it's nice to know that if, if something happens to our company, your data is still, yeah, in an open source database. So you can run it yourself or whatever you whatever you want. So this was the main reason why we decided to, to make it open source. And then, yeah, it's it's also like a thing we wanted to try. You know,

we have, our monitoring solution is closed source

and we like to try to make this open source and see where it goes.

And what have been some of the most challenging aspects of building and maintaining the project?

I think,

building the scalability

was the most difficult

Because we want to be able to scale on the fly, so you have no downtime involved

while scaling.

So

this

gives, the system or the database, so to say, a whole new state during this process, where it needs to know both the old and the new

state where where it is in. So, yeah, I guess that that part was the most difficult feature to build. And,

right now, it's it's it's mostly extending with new features. So I think that's that's that's,

simple, so to say, compared to the scalability,

which we already, have we have created.

And for somebody who wants to start using Siri, what does the deployment process look like and what are some of the,

resources

or environmental considerations that they should be thinking of?

Yeah. Like we said, it's written in c, so it should compile on most systems.

It's,

not asking a lot of system resources, so it, uses

a pretty low memory

and pretty low CPU usage.

But,

yeah, you you should,

think about that series not really,

does not really fit well with then you have, like, for example,

billions and billions

of data points on a single metric. So it it it's better used if you have, like, a couple of 1, 000, 000

series or metrics,

each having a 1000000 records. That's better than only a few

series with 1, 000, 000, 000 of records on a single metric, if you understand what I'm saying. Sorry.

And so that it sounds like that's largely because

if you have your data spread across more different

series then you're able to scale that horizontally.

Whereas if you have a smaller number of series that you're tracking metrics for and a large volume of them, then because of the way that the that that the data gets balanced, it would constrain that all to a single host. So you would be,

so so you would be constrained to having to scale vertically as opposed to being able to scale out. Is that accurate? Yeah. That that's, that's right. That that's the main problem. I I think we can solve this a little bit by maybe,

adding continuous queries to 0 d b. That's something which we don't have right now.

Maybe this can help a little bit in solving this issue,

because then we would able to at least

aggregate

your time series to less points. So

because I I guess that if you store, like, billions and billions of records just on a single metric, you don't want to query them all. So

there might be a solution in,

how do how do you say it in English? Like,

compressing this data? Yep. To

less points. I believe it's called like continuous queries in,

on all of time series databases.

And I think it can help in this problem. But at the moment,

CODB is better in

having a lot of metrics, each bit like like a 1000000 records or or a few more is not a problem, but it shouldn't be like a 1000000000 of records, on a single metric.

And given that the project is open source now, are there any particular areas of contribution that you are looking for help?

Yes. Yeah. But what I mentioned before was that we want to add, like, more,

programming languages. Like,

we would like to have native clients, for example, for Java. And it would be nice if we got some support from the community which can help building these connectors,

so to say. And then another thing is like

I think CODP is also interesting for home automation projects, just

because it runs on the Raspberry Pi and it uses low resources.

I think it's

a good time series database in that

area.

And there are also a lot of home automation projects.

You can also,

it would be nice to see connectors from these projects to 0db.

So it's,

easier to use 0db in that, in that area.

Are there any other topics that we didn't talk about yet that you think we should cover?

No, I don't think so. I think you covered most of the things. Okay. Well for anybody who wants to follow the work that you're up to and keep up to date with Siri and the other things that you're working on, I'll have you add your preferred contact information to the show notes. And just for a final parting question,

from your perspective,

what is the

biggest need that you see in the available tooling or technology

for people who are working in the data management industry?

I I would like to see, like, may maybe another database which is has a focus on different things. More like,

the subscription model, you know, like, where you can subscribe to,

for example,

a metric and get

updates back on the metric just when received when a new value is received or something like that. You know? I would like to see the database doing that instead of, like, our database, which you need to query all the time. And I don't think there are a lot of databases which are, doing this in a very good way, so to say. So I but may maybe I don't know them all, but, I don't know any database which is doing this really well. Alright. Well, thank you very much for taking the time out of your day to join me and talk about the work you're doing with Siri. It's definitely an interesting project

and 1 that I am likely to start experimenting with on my own. So thank you for that and I hope you enjoy the rest of your day. Yeah. You too. Thank you, Tobias, for this interview and, I hope you enjoy your day too.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links