Eliminate The Bottlenecks In Your Key/Value Storage With SpeeDB

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Adi Galvan about his work on SpeedDB,

the next generation data engine. So, Adi, can you start by introducing yourself?

Yeah. Hi. My name is Adi Galvan. I'm 1 of the 3 cofounders

of SpeedDB.

Born and raised in Israel, done my

studies in math and computer science,

career in the IT realm,

then moved to work for a couple of start ups, deep background in storage and data.

Cofounders, they have deep background in data structures.

And do you remember how you first got involved in the area of working with data?

Well, I started my sales career

in EMC,

then moved to a company called XIV, which is in the grid storage,

then had my time in IBM managing the storage division, selling

storage hardware and software. I moved to Infinidat where I actually met my cofounders.

So storage and data has been my world in the past 15 years or so.

In terms of the work that you're doing now at SpeedDB, I'm wondering if you can describe what the focus is, what it is that you're building there, and some of the story behind how you settled on this as a particular problem that you wanted to spend your time and energy on?

I think the story is is rather simple. My cofounders, they were the chief architect and chief scientist in a storage unicorn called Infinidat.

At some point, they were given a mission to find a software layer that will manage the metadata within the storage. They thought about maybe writing the whole data stack from scratch,

which was a no go. And then they thought, okay. Let's try to find a third party

product that could do that.

And they went to the market and found out that the storage engine market is pretty

known, pretty stable,

controlled by companies like Facebook and Google and Apple and Oracle.

And they looked at the product and decided to go for the most prevalent LSM storage engine, which was Rocks DB from Facebook used by thousands of customers world worldwide,

and they decided to embed it in the storage.

So, they got into the mission. They actually did it, and they saw that it was working perfectly fine in small datasets.

But when it came to the larger datasets,

things did not really work as advertised.

So they reached out to the community and to Facebook, and they realized that Rocks DB as the other storage engines was actually designed to support metadata,

and metadata is usually

considered as small data. So if you wanna manage small sizes of data, this is great. If you wanna

manage large amounts of data,

like hundreds of gigabytes and terabytes and terabytes.

You need to work around this problem and charge the problem,

which wasn't really acceptable

from their side. That was the problem. They decided to do something about it. And in terms of the

use cases

where you are gonna end up with these large scales and large volumes of data in a RocksDB

storage layer,

I'm curious what are the situations that will lead to that and some of the ways that that problem manifests. You mentioned having to shard the data, but I'm curious to sort of what types of systems are going to be embedding RocksDB and pushing that volume of data into it or

the types of scale that are going to lead to those limitations of Rocks

DB? I I think the good news or the bad news, you know, depending

where you're looking at the problem

is that the structure or the skew of the data

created in this world is actually changing.

So a decade ago, metadata was actually a fraction of the data, but now with the unstructured data and the connected devices

world,

then the data looks different.

It's many many small objects

with everyone having a metadata of its own. And when you look at it, then sometimes the metadata is bigger than the data itself. So

large systems today,

even if they don't have enormous amounts of data, they may have enormous amounts of objects

and large amounts of metadata.

Now when you look at the ratio between the data and the metadata, sometimes the metadata could even be bigger. So when you're looking at systems, it's not so much about use cases, but it's

a matter of where you're getting your data from. We know that the the fastest

data segment,

the fastest growth in data segment today is unstructured data.

So when you look at

the combination of unstructured data,

connected

devices, IoT,

then you realize that

the way data is being

managed should also change. You mentioned that

1 of the reasons that we are seeing such a strain on the Rocks DB layer is because of this

skew in the relative proportions of metadata to,

I guess, you could call it actual data, but primary data.

And I'm wondering if there are any particular

industries or applications where you're seeing this happen

more predominantly

and what you're seeing as your sort of target market and customer for the work that you're doing at SpeedDB?

I think that we see this phenomena

happening across all segments.

When you look at traditional databases,

then the challenges they're faced with

are the same because now customers are simply maintaining and keeping

more data in smaller sizes of data.

And you see it in cyber applications

when you need to monitor all the traffic coming from the web. And you're talking about those billions of JSONs and small objects.

You see it in retail.

You almost see it everywhere because

I think that the mantra today is that you need to keep all the data because every data has value

and you actually wanna store it all which

nowadays seems pretty obvious. But when you wanna store it all there are plenty of ways to do that because storage systems are super

super advanced, but then you wanna make sense out of it. You need to store them in databases,

and then you need to fetch the data.

I think 1 1 thing 1 place where we're seeing it is streaming data.

So you've got all these objects coming into the applications

in streaming mode.

Before you had the streaming databases, you needed to store the data then access it. Now you really wanna fetch the data in its streaming,

so you need to keep it state.

So streaming databases

are being more and more prevalent.

More and more customers are

trying to get insights

about data coming from streaming. So we see those use cases pretty prevalent out there. 1 of the interesting things about Rocks DB

and the work you're doing at SpeedDB is that it's usually not going to be the

user facing layer that people are interacting with. It's an embedded key value store that is either used for metadata tracking or

tracking state across different streaming partitions in something like Kafka or ksqlDB,

or it's used as the foundational key value store that a more full featured database product is built on top of.

And I'm curious,

as you're working with customers who are running into some of these scaling limitations of RocksDB,

what you're seeing is some of the ways that those limitations

manifest at that higher level in these sort of higher layers of the stack and

what the sort of debugging process looks like when these customers run into these situations

and ultimately realize that it is actually Rocks DB that's the problem?

Before this company, I ran the business of a database company, database management system company.

And before that, I had my career in storage. I can tell you that when you look at the data layers, you have the application, then the database,

and then

the storage. Now

each of these layers have been tremendously developed in the past decade.

The only part that didn't scale with the database and the storage is this intermediate layer of the storage engine.

So, consequently,

you see databases

are scaling almost indefinitely.

Storage systems are enabling you to store large capacities whether it's SAN, NAS, object,

you name it. But the software stack between them, the storage engine,

which was hidden,

is now becoming the bottleneck because it hasn't been developed

in the same pace. So

when you look at database management systems,

most of the DBAs did not even know that there was a storage engine running behind it. Now that the data structure has changed, the number of object in the databases has changed, and you scale it like the database can,

then you realize that storage is fine, database is fine, but something's not working correctly. And then you say, okay. Where's the bottleneck? And you find

out more and more cases that the storage engine is actually in charge

of how data is being stored between the database

and the storage.

Things like LSM trees, b plus trees, no 1 knew these, you know, terms before. Now people actually get to know it, and they're like, okay, this is the bottleneck. How do we solve it?

Now there are workarounds. You can shard the data. You can put multiple nodes,

but that's like putting a plaster.

It's not solving the problem because

once it's super complex from a development point of view,

second and not less important, it's very, very expensive,

especially if you're on on the cloud.

Spinning up thousands of nodes is very easy, but it's very costly.

So the work we're doing here in SpeedDB is enabling you to scale up rather than scale out and do more

with the current hardware you have and enable you to run more workload,

more data,

and more threads on the existing hardware.

As you are working with customers who have come to the conclusion that RocksDB is the bottleneck in their system, I'm curious if there are any

particular trends in terms of the actual

higher level tools or systems that people are running where they are experiencing these issues, whether it's something like a kccooldb

where they're trying to scale that out and scale usage and they're hitting that limitation or some of these other database engines if there are just any particular sort of trends or

product categories that you're seeing people experience these problems with.

So first of all, I think the obvious thing is that the more

data per node you have, the bigger are the chances that you will suffer from things like right amplification

and stalls from your system.

The more threads, the more processes you have on that node

will bring you to the same place. The slower the media is,

the impact of the right amplification, which we solve,

is gonna be bigger.

And the more mixed workloads,

read and writes that you're doing

in parallel

will challenge your storage engine.

And I think that what I just mentioned,

we see with customers even if they don't have this problem today,

they will have it tomorrow because

the amount of data we need to keep is expanding.

The amount of resources

we can afford is limited,

and the number of applications

doing more processes

with the database

is increasing because you wanna have more features, support more users.

So

it's gonna come anyway.

In a month, 2, in a

year, everyone suffers from it. And you mentioned that in your work at Infinidat,

you ran into some of these bottlenecks, which led you to creating SpeedDB.

And I'm wondering, what are some of the attributes of the Rocks DB storage engine that you think have led to its widespread usage as an embedded key value store and so many other systems being built on top of it? And as a corollary to that, I'm wondering if you see any opportunities

in

applying some of the lessons in your work at SpeedDB to some of the other embedded key value stores like LevelDB or LMDB?

I've made history about RocksDB. So Rocks DB is actually a fork of LevelDB from Google.

The team in Google tried to use LevelDB, which was an open source,

and they found out that it wasn't really great with fast media. It was designed to run on spinning drives,

and it wasn't really good with multi threading.

So the changes they did,

they invested in the multi threading

and in the customization

of the LS entry to faster drives.

They've built a very robust and stable system. So if you take Rocks DB vanilla today,

it's very stable,

very powerful,

very tunable.

You have more than 250 parameters

that are intertwined so you can actually configure it for every workload

which is a big advantage, but it's also disadvantage because you have to have a PhD

in LSM trees and RocksDB in order to really tune it well. So why people are using it? Obviously,

Facebook did it, did a great job, and it's a very powerful storage engine. It's very complex. So writing something

like this of your own is not an easy task, and it's open source. What we realized so this was actually the decision that made us try it on Infinidat.

What we realized is that

working with RocksDB,

the vanilla version out of the box is really perfect if you're running use cases like Facebooks,

which means small datasets on various nodes,

but doesn't really work well

if you're limited with CPU and memory

and a limited number of nodes,

which we think is more applicable to the rest of the customers out there. We also realized that there is a large amount of tuning you need to do with the Rocks DB parameters

because most of these parameters are intertwined

means that they affect each other,

then tuning it is almost an impossible mission if you're not an expert to RocksDB.

So the idea was to generate

or rewrite the whole LSM stack

so it will fit

the general purpose customer,

customers who need to scale data, do multithreading,

work with terabytes of data

without charting,

and use less memory and CPU?

And, also, how do you get not get read, but how do you eliminate the dependency

of tuning

the system all the time? So

we've been working on auto tuning of the parameters

in the Rocks DB or in SpeedDB

according to the workload.

We've learned a lot during the development.

We've even learned more from working with customers because when you meet customers,

you realize that all the perfect theories that worked great on PowerPoint,

almost great in the lab,

They hardly work with customers because you realize that

customers are doing things that they wanna do, not designed for SpeedDB or OxDB. So

we realized that every customer has its own special needs,

and with every customer we've worked with, improved

our product. And I think that today,

after having various customers from various fields,

we feel that we've learned a lot. We still keep improving.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality and their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

While I was researching for this interview, 1 of the other things I came across was a project called Tetrarch DB, which is another fork of Rocks DB

that I believe maybe predates your work a little bit, but I'm assuming went in a slightly different direction.

And another project

that I believe was a reimplementation of RocksDB is the Badger storage engine from

what the folks at Degraf were doing.

I'm just wondering if you have any thoughts on some of the sort of compare and contrast of what you're building at SpeedDB,

how it compares to Tetrarch,

and maybe some of the

potential lessons learned in the work that the d graph folks did with Badger

to make

the key value storage engine

go native and maybe some of the opportunities for improving the language bindings for things like SpeedDB?

You're probably referring to TeraRk, which is what Bytedance,

the owners of TikTok. Yes. That's the 1. Yeah. So TeraRk was an independent company acquired by ByteDance. They needed

their own storage engine like many

many big companies.

We know TeraRk. It's actually a pretty good storage engine.

They actually took RocksDB. They forked it, but they

forked it to their own needs.

Now the needs of ByteDance

is mainly

to deal with

super large objects, which are predominantly movies. They're using a combination

of large objects that they put in Blob DB and the metadata that they're

managing in in the Rocks DB itself.

We've actually tested it. It's pretty good. It's pretty effective with very big objects.

So they put effort on the compression and, you know, things that will serve them well in the tail response time that you have in super large objects.

We did more work on the

deep tech inside. We said, okay. What are the main issues that we wanna solve? Not for Bydance, not for Facebook, but for the majority of the users.

And we realized that the worst

or the biggest challenge

is the right amplification.

To solve the right amplification,

you can't really play with the parameters. You need to do something much more,

I will say, on a lower level. We needed to rebuild the LSM tree and to think out of the box, how do

we eliminate

or reduce

the right amplification factor from 30 to 5?

So we went down that path, and we found a way

to reduce it by redesigning

the LSM tree in a way that we added more dimensions

to the way LSM work. We call it multidimensional

LSM.

That way we came to the point where we could do much more with much less.

Then we realized we needed to add some more things because we improved the ratification,

but we added

more things that weren't there. So we needed to design

new bloom filters and new index

and the way the processes are being written and the background mechanism. So we found ourselves trying to improve 1 thing then redesigning the whole thing

that will actually be bettering every parameter than we wanted.

I think that for by dense use,

probably, Terra will work better than what we did

because it's really meant to solve their problems.

But we do believe that if customers have high workload,

large capacities,

1, 000, 000, 000 and trillions of small objects,

SpeedDB will probably give the best results.

So you mentioned that for the work that you're doing at SpeedDB, you introduced some of these multidimensional

LSM

trees. And

I'm curious if you can dig into some of the other optimizations

that you added to RocksDB

and your overall approach to identifying

some of the

bottlenecks and performance issues in the system that were

options and opportunities for being able to introduce

different algorithmic approaches or different storage algorithms?

I'll give a couple of examples. So the difference between an LSM tree and a b tree is that b tree is very read oriented.

You can read objects very fast because the search

complexity is very low. With an LSM tree, you can write very fast

because you're writing to the cache and then it's writing immutable files to the layers.

Customers

till today, if they need a write oriented system, they will go for an LSM

like Facebook. And if you need a read oriented

like MySQL, you'll go for the b tree. The choice is very clear. Read oriented b tree, byte oriented LSM.

But customers, you know,

applications

are not built that way. They simply wanna give service to the customers,

and most of the customers are doing mixed workloads.

And we thought that we want to enable those customers to have large capacity,

multiple options,

multi threaded,

being able to read and write fast.

We needed to choose what we're working first on. We said, okay. Let's improve the right amplification.

When we did that, we saw that we are hurting the space amplification

and that we are even

making the read worse because we're working in multi dimensions, which make the read more complex. So we needed to design

a new index

that will help us search a new bloom filter,

right,

and design new partitioning

within the LSM so we can actually access

stuff easily.

All these technologies that we've developed,

some of them were very frustrating till we found the right path. You could actually have a a win win, and it's not easy, I can tell you.

Even when things work perfectly in the lab with customers, we had some more challenges.

And through all this, you gotta make sure that you're still aligned

with the Rocks DB so you can actually help customers who are using Rocks DB so you can still be a drop and replace.

So the challenge of maintaining the backwards compatibility

with Rocks DB, but doing all these things were pretty challenging.

So we pretty much touched everything from the memory allocation,

bloom filters, indexes,

right flow management,

whatever you could think of. We've touched, and there's still a lot more to do.

As you were going through these optimizations and dealing with the

compatibility guarantees

and dealing with running into customer use cases,

I'm wondering what are some of the trade offs that you made

in the name of

speed

and scalability

that you deemed acceptable in the process of optimizing for those use cases and potentially introducing

reduced capacity in other dimensions for people who are using RocksDB,

the vanilla version?

As Israeli founders, we thought we'd rather be rich and healthy.

We wanna do read

as fast as b tree and write 10 times faster than an l centree, you know, and we're smart enough to do that. And in the PowerPoint, it was actually perfect.

When we reached the lab, we found the laws of physics are not really aligned with the PowerPoint.

We saw that the work we've done improved the right by 10

fold, but we were really we really heard it in the read. So we had to do some new designs.

We managed to do that pretty well,

and we're now reading very, very fast,

much faster than Rocks DB. Not yet like a Bee Tree, but we're getting there.

The only thing we couldn't solve

till today,

if you have small

data sizes,

if your data is very small, you will pay the price of our robust system.

Right? So if your data size is pretty small,

if it feeds into the memory or slightly over, then

you will either get no value or little value of our system.

And since Rocks DB is for free, then you better use the vanilla version,

unless you need support and services, you know, stuff that Facebook will not give you. So if you're a customer with very, very small capacities,

the price we're paying for being very robust and very scalable

in huge sizes,

it makes us a bit complex in the small data sizes. Right? You're getting tons

of possibilities.

So small data sizes under,

I guess, 10, 15 gig, we haven't managed to really prove our superiority.

Lucky for us, the world is going that side. So the data is growing, and if you're using 10, 15 gig gig today, you'll use more than 100 gig in 3 to 6 months, and you'll need us.

We are

a storage engine or a data engine because we're dealing with data now, not only metadata,

for large capacities.

In terms of actually

adopting SpeedDB and integrating it into

a platform, I'm wondering what that process looks like and how you engage with customers to identify that you are actually going to

provide an improvement over their current situation and then actually swapping out the Rocks DB layer from whatever higher level framework they're deploying?

So I will start with the number 1 challenge.

Most of our customers are using Rocks DB, and they do not know that.

I've spoken to c level, you know, guys saying we're doing things much better than Rocks DB, and they were like, what is Rocks DB?

Only when we spoke to the VP R and D or the chief architect or the developers, they said, ah, Rocks DB. Great. Wanna try it? So 1 of our challenges is to get to the right people,

and I'll elaborate more about our plans going open source because we found out that this would be the right way

to approach these people, so we're now working on our open source version.

So this is 1 challenge. 2nd is that we we chose pretty early on to be backwards compatible or 100%

compatible with Rocks DB

because our understanding

of developers and customers,

they don't wanna work

in the storage engine. They don't wanna do changes in their code, and they don't wanna wanna work for you. They wanna work for themselves.

So we made sure it's a drop and replace.

Today, if you're a Rocks DB user,

in 30 seconds, you'll replace the library,

and you'll start working. We'll open the Rocks DB database,

and you can rock and roll. So there is no change whatsoever from your side, not on the API and not on the code. So it's a simple drop and replace, and we have customers actually doing it every day.

It forced us is to make sure we do everything

in the back end within the storage engine and not outside.

So this took us some time to solve, you know, things without changing the API.

On that note of customer education, you mentioned that it's often difficult for you to help the customer understand that they're even running RocksDB and that this is the proper solution.

And I'm curious how you have been approaching that aspect of

raising awareness of this as a potential bottleneck for these systems, understanding

how to find customers when they are running into these problems, and just helping them understand the

potential ramifications

and positive benefits of replacing RocksDB?

Educating customers is a bad thing.

If you wanna educate customers, you're probably not in the right business.

Customers don't wanna be educated.

Lucky for us, the storage engine is becoming a problem not because of awareness, but it is there. And the more data you have, the more

things you do with the data,

the storage engine becomes a bottleneck.

When you have problems,

then they go to the levels above. And at some point,

someone very senior will say, why are our customers not happy? Why do they leave us? Why do they have stalls? Why is performance

deteriorating?

So the problems are there. The technical people know about them. If you are a developer writing a data stack, you know about BroxDB, and you know that you're using it, and you know the bottlenecks because you're managing it. Our challenge was how do we

approach these guys

who are not

really media guys. They're not there on, you know, LinkedIn and Facebook. They're

user groups and developer forums,

and you really need to come to them there and not with marketing pitches, like we're 10 times faster, 10 times bigger. You need to really come to them with data, with benchmarks, and to approach them on a technical level,

which led us to the conclusion that if we wanna build

a user community

and

raise awareness of these guys, the only way forward is to come with an open core offering

that will actually give them an open source version that they could use, see the benefit,

and give them an enterprise stack to work with us in, like, production. This decision has been taken.

Now we are working on our 1st open core or open source version

that will soon come out

that hopefully will build a huge community

of Roxtec users

and will give them both the product and the technology they need to scale and a company to support them.

Once you have reached a customer and they see the benefits of SpeedDB, they want to move forward with getting integrated into their stack. Is there any

data migration that's required to be able to actually make that switch, or is it something where you can just

replace a dependency

in the build of the database engine

and SpeedDB is able to actually

interact bit for bit with all the data that's already laid out on disk so that you don't have to do any

sort of big effort to make that migration?

Great question because this could have been an issue.

But early on, we decided to stay 100%

compatible with Rocks DB, which means that the API is identical.

So the thing about key value

is that the protocol is pretty simple.

You put a key, you put a value, and then you call for a key and you get a value. So the product call is pretty simple,

but the layout

on the media

is different, but that's internal. So we've designed our system that we can actually open every Rocks DB database. So for a customer point of view, it's a drop and replace.

So

there's no data data migration, but we do the changes within the storage engine

as we go.

So your data that resides on Rocks DB, when you change it to SpeedDB,

we're taking it to SpeedDB,

and the data layout is changing

in the back end as you're working.

So from a customer point of view, drop and replace no data migration.

From our point of view, we're doing the work behind the scenes.

As far as

the effort of producing an open source version of SpeedDB,

1 of the challenges that always comes up there is that when you're first iterating on the code, you have the assumption that this is internal. We can make some shortcuts, or we can use some internal idioms in how it's written.

And I'm wondering what you're seeing as some of the

work that you need to do as you go through and maybe clean up some of the structure or clean up some of the semantics of the code to make it publicly consumable and something that you feel happy with releasing to the saying to the

community? None of us was an open source guy before SPDB.

And people asked me about a year ago when we founded the company, how about going open source? And I was like,

no. That's an oxymoron business and open source. How can you make money from

giving things for free?

At some point, we realized that open source is the right way because you can sell to enterprises, but if you wanna conquer the world,

you need the developers with you. So open source is definitely the the the right way. And then I said to the guys, okay. Let's go open source. Let's

put our code on GitHub, you know, raise a community.

And then people stopped me and said, hey. Hey. You need to do some homework. Started doing the homework. We hired some very smart consultants, people who have done this before.

And then I realized that it's not a 2 weeks

work. It's 2 quarters work at least.

And the code that we've written in our labs, which is perfectly readable to our developers,

is not really open source grade,

and you really need to make changes from the code

to the behavior

to the transparency.

Right?

It's like taking a

restaurant when you have a closed kitchen

and putting windows,

like, to the kitchen. Right?

We are doing this work right now,

and it means changing the mindset, doing things different, hiring people,

things like community manager, developer advocates.

We need to change our website.

If you come to our to our website, it's very sexy, very cool for enterprise.

If you are a developer, you're looking at it and say, okay. Where is the real stuff?

Right? So there's a lot of work we need to do behind the scenes.

I wish it wasn't like this, but we're doing it.

We wanna do it right. So it might take another month or 2, but we wanna make sure we come out right with the open source because we really wanna build a healthy, vibrant community

and give them real value.

Yeah. It's definitely always interesting getting people's perspectives,

particularly from somebody like you who doesn't have a long background in open source and coming to this community and the ways of working because it can be very

different. So it's always interesting seeing people go through that shift.

Yeah. I actually gone through this myself.

I can

describe the transition from

disbelief

to maybe this is the right way,

understanding that probably this is the way,

then coming to the conclusion this is the right way, and then asking question, okay. How do I do that?

And then say, okay. Let's do that.

But then

deep into the pool and say, wow. This is a a huge thing. It's not just putting your code on GitHub. It's actually

building it right. Right?

You know, you need to change,

so we're doing it. Well, definitely happy to hear that. Happy to help you broadcast that so that folks can keep an eye on your work there.

And in terms of the work that you have been doing and the ways that your customers have been employing SpeedDB,

I'm curious, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

Our obvious

go to market strategy was Rocks DB replacement. Right? If you're a user, if you're scaling, you have issues.

And then it wasn't really surprising to see customers saying, yeah. We're we're suffering,

and we would like to

use PDB, and they're willing to pay for it because it really solves a big problem.

What we didn't realize is that some customers

wanted to work with a storage engine,

but their system was not really built to work with an embedded storage engine.

And some of them actually approached us saying, how do we embed SpeedDB into a system

that is not Rocks DB compliant?

And we said, sorry. We can't help you, and they did, no. No. We are willing to do changes

in the API

or help you work with us to embed SpeedDB into our code because

we think it will yield better results.

So 1 of the

rising cloud providers

providing s 3 services

actually

cut a deal with us.

It's still not public, but

he will run his old metadata

of the

s 3 service

on SpeedDB.

So he's actually moving it from a general purpose file system

to an embedded storage engine key value store, which provides much bigger scalability

and performance and resource utilization.

And it was pretty surprising from our point of view to see that customers are

realizing that they needed to do this change and the world is actually

moving towards the key value store.

Timescale DB from your friends at Timescale is the leading open source relational database with support for time series data.

Time series data is time stamped so you can measure how a system is changing.

Time series data is relentless and requires a database like TimescaleDB

with speed and petabyte scale.

Understand the past, monitor the present, and predict the future. That's Timescale.

Visit them today at dataengineeringpodcast.com/timescale.

In your experience

of building the technology

and the business around SpeedDB, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

We've learned that you can think whatever you want

and test it, but till you reach the market and speak to those customers

and really work on their environments, you don't really know what to expect.

I think that predominantly that's what led us to decide on the open source, not only as a go to market, but understanding

that when we are open source, then the customers will actually

use you

and give you feedback about how it's working.

You won't be under the impression that it's working perfect in the lab, but you actually know that it's also working good for customers.

We addressed this a little bit, but for people who are interested in being able to scale up the speed and volume of data that they're pushing into something like Rocks DB? What are the cases where SpeedDB is the wrong choice and maybe they're better suited with the work that's been done at Terra Arc or

using maybe the Badger implementation that underlies d graph?

I think I said that in small capacities of data,

we don't see that we bring value today. If your data size is less than

10 or 5 gig,

you're probably okay with the Vanilla Rocks DB.

If you are using specific use cases like in super large objects

like Terra Arc, I don't think

we've done the necessary changes

for us to be better than Terra Arc. We may help them with some use cases, but I think they've done some very good work in customizing it to their own needs.

But I think that the 90%

out there

that are really struggling

with the raising amount of

objects and capacity and multi threading

and mixed workloads

they will greatly benefit from SPDB.

And not to forget, we realized that

many of these customers,

apart from the technology,

they need a place to ask for features,

to get support,

to get help with customization

and vertical alignment with their application,

which is something that I don't think any company does today.

So when you get SPDB, you don't only get the technology and the benefits, but you get a company that will really

be there for you when you need it. And

working in the data space, you always need help with

doing stuff.

1 of the things too that I'm interested in your thoughts on is Rocks DB unlocked a lot of potential and value for all these platforms that were built on top of it. And now that you have SpeedDB as a technology layer that's available for other people to build on top of, what do you see as some of the

potential future use cases or

unrealized applications

that folks might start to build on it now that there is this increase in speed and scale that's available to them?

This is a long term vision. Right? But nowadays, if you write an application,

you write your application, and you know that you will need to connect to an underlying database.

But that's something you've been led to understand that an application needs to work within a full managed database.

What if your application simply needs a key value store

to manage data?

What if you don't need a full

blown data management system?

Think like services like DynamoDB.

They realized it, and they give you as a developer an option to really connect to the DynamoDB and get key value store or key value services

for your needs. The problem that they're huge and they're pretty limited,

but think if you had a platform

that would be key value store as a service on the cloud where you can actually write your application

and get a key value store,

not a full managed database with all the dependencies

and all the heavyweight,

but a key value store to help you manage the metadata and the data itself

with the local media.

That would actually change the mindset of developers being able to write an

application and simply get the key value store as a service.

So

we think that once

we are

able

to spread

SpeedDB

out there and build a huge community,

then the next step would be to build a cloud platform.

We'll call it a key value store as a service

in which we will allow every developer to start with with our open source product

and then move to our managed service that will give him key value stores as a service.

I think that's the long term vision we wanna be.

Are there any other aspects of the work that you're doing on SpeedDB

or the overall space of

RocksDB and key value stores

or anything in terms of the

scalability of speed and volume of data that SpeedDB enables that we didn't discuss yet that you'd like to cover before we close out the show?

We think that there's a lot of work to be done around storage engine or data engines.

So performance is 1 thing. Scalability is another thing. Resource utilization is another thing. Vertical alignment to the application

is also another dimension that we still have to work in.

I think that what we're trying to do is we're trying to open those bottlenecks

that have been created

in the past decade

and allow those applications to actually do more with the resources they have today.

1 of the benefits that the cloud platforms or the cloud providers brought us is the ability to do things very fast.

You can write your code. You can get infrastructure

as a service,

spin up servers, and you have everything very handy and very quick. So the time to market is very, very good,

but no 1 thought about the cost of this.

We think that by opening bottlenecks in the storage engine, we will allow

customers to actually

do much more

with much less, which we think is a very important thing. Once we've done that, we think we'll be able to offer them a platform to get a service that they can do whatever they want on much less hardware.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and eventually contribute to the open source work that you're doing. I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So I think that the database management systems have made things very easy,

the developers,

to get what they need.

So in terms of usage,

it's pretty simple, and everyone is now moving to the SaaS world so you can get those services

on the cloud, which is great. I think the only thing they haven't really

addressed is ham, the costing. So things are very easy,

but very costly because they're working very inefficiently

in terms of the data usage.

And I think that's the problem we're trying to solve.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at SpeedDB and your insights on the lower level layers of the

data storage stack and some of the optimizations that are available to be made there. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you very much. It was a pleasure.

Learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links