Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Your host is Tobias Macy. And today, I'm interviewing Matt Jaffe about FeatureBase,

formerly known as Palossa and Molecular, a a real time analytical database engine built on bitmaps. So, Matt, can you start by introducing yourself?

Matt Jaffe. I've

been working

at Futurebase since it was Molecular and before when it was Telosa, and before that at the company we spun out of, which was called Umbl back in 2015.

Prior to that, I was

a government contractor for about a year and a half after I got out of graduate school where I studied

computer science, focus on distributed systems and networking.

And do you remember how you first got started working in data?

Yeah. So

I guess

if you're in computer science and you're in software

these days, you're gonna crash into data pretty quickly.

I was in grad school. I was thinking about getting a PhD.

I really just had an itch to write a lot of software and was kind of frustrated just writing papers. So

I ended up taking a job at at a government contractor

where I worked. I got to do a a wide variety of stuff there, actually, everything from

building

GUI programs in Python to

inspecting network traffic on a GPU and some stuff in between.

But, ultimately,

wanted to live someplace different and wanted to kinda have a different workplace culture, and that's when I found

Umbble in Austin, Texas.

So I joined Umbble, which was a

kind of a SaaS platform for marketing teams to understand

their audience better on, you know, be able to collect data and figure out who their audience is and how to market to them.

And that's kinda all I thought it was when I joined, and I think it was about 2 weeks in

after being there that I was, you know, talking to the

chief architect and then he was telling me about this,

essentially, this distributed database that they built internally to handle some of the more difficult queries.

And I mean, this thing was wild, like, it was written in Go, which was very new at the time. You know, it it was

like it was a distributed system and and it was storing data in this really strange way where where everything was encoded as a bitmap.

And so I, you know, I was just taken with this. I was like, I want to work on this thing, like, when can I start?

And about a year and a half later, we had the opportunity

to spin out a company just around that piece of the infrastructure. So that was Pelosi.

That was really exciting.

And basically, ever since then, I, you know, just

been developing a database. So learning a lot about databases

and the data engineering that, you know, we're getting data to end up in the database and, you know, how to make the database behave the way you need it to for what you wanna do. You know, it's it's all connected.

And so in terms of the feature based

project and product itself,

done interviews

previously about the Pelosa database engine before it became feature based and about Molekula, the business.

But for people who haven't listened to those, I'm wondering if you can just give a quick overview about what you're building at FutureBase

and some of the use cases that it is designed and optimized for.

We've had a a long journey.

So basically,

the core

tech has never really changed. It's a database that is built

on a pretty unique

storage format and that it uses bitmap indexes as the primary

data representation, you know, on disk and in memory. And

and everything is sort of built around that. You know, the query optimizer, the query planner,

the execution engine is all built with bitmaps in mind to take

maximum advantage of them.

And

it's really

there's a lot of sort of understood,

you know,

truisms

about bitmaps and bitmap indexes if you're into databases that

don't actually hold with some of the modern approaches

to to bitmap compression and some other techniques that have, you know, come around even in just the last decade or so.

We've really found that the use cases for these things have exploded so much so that we've gone from

a very specialized tool that was built to handle a certain part of the query workload in a certain product to

we just have a relational database, you know, built on top of bitmaps. And

we're constantly expanding

functionality around that where you can do all kinds of SQL queries and basically

do whatever kind of query you need to do,

but very much with a focus on

analytics.

So there's so

many dimensions that you can,

look at databases on,

but I think the main spectrum is just from

transactional workloads

to analytical workloads,

and

feature based is dialed

all the way to the analytical side. I think because it's built

entirely on bitmaps, the trade offs are more in favor of analytics than any other database out there, which does mean that the transactional workloads suffer. You know, we're not getting something for nothing, but the analytical workloads can be

incredibly efficient.

As far as the

capabilities of Telosa,

because of the fact that it is operating on these bitmap indexes and has these efficiency gains as a result.

I'm curious if there are any types of analysis or any types of

the

to eliminate components of architecture that would otherwise be necessary in a more, quote unquote, traditional database engine.

When you look

at analytical workloads

generally,

there are definitely some that feature base is better at than others. And,

well, let me let me start by telling you kind of the original use case that Belloso was developed for, you know, way back when, and then we can kind of expand from there.

So

back at Umbl, we had this problem where

1 of our main queries that people wanted to do was

show me the top

interests

among people in my audience. So maybe maybe it's the most liked Facebook pages among people in my audience. That's a sorted group buy, essentially.

That by itself is okay.

But if you wanna do that same sort of group by on

any subsegment of your audience, kind of in real time where you're saying,

I just wanna know the top Facebook likes among,

you know, females from Massachusetts

who like the NBA or any

complicated where clause you can think of and then get that top list back.

If you've got, you know, 100 of millions of consumers potentially

and, you know, the universe of Facebook likes is in the tens of millions or more,

that becomes a very challenging query. And we were using Elasticsearch at the time, which, you know, back in 2013,

2014,

Elasticsearch was not what it is today

and our 20 node cluster was just falling over or, you know, the garbage collector would run and the queries would take 20 seconds or whatever.

So

that was the original use case for

Pelosa at the time was

be able to do these sorted group buys

on very specific segments of data in real time because the whole thing had to power a web app, you know, that people were just clicking around in. And some of those page loads in that web app, like you would define

a segment in the app and then it would do these top end queries on a whole bunch of different

columns in the data all at once. So every page load might be doing, you know, 20 of these queries.

And so it it just had to be way more efficient on the back end for us to

have a business. Right? That's where it was kinda born from.

You can kinda start imagining from that type of segmentation workload. It doesn't have to be marketing. It doesn't have to be advertising. There's lots of things that wanna do this. Anything from, you know, like, high energy physics to,

you know, intrusion detection systems wanna do things like this. That's That's really the sweet spot where you have complicated where clauses and you wanna do sorted group buys

or really anything that a columnar database

would be good at. And then in particular, anything that wants to look at specific values

within a column

because that's where

feature base really shines is

instead of having to scan an entire column because everything is split out into bitmaps, you can actually just scan the data that's about a particular

distinct value

without having to scan the rest of the column. It's really like you get the benefits of a column or database plus the benefit of being able to sort of address things down to

distinct values.

Because of those

combination of benefits, I'm curious if there are any

approaches that have been built up around the capabilities of feature base in terms

of rethinking the way that you structure the data that you're storing or ways that you think about the

query patterns or the ways that you structure your analysis because of those efficiency gains. Whereas you may, you know, optimize in a different direction if you're going for a columnar store or if you're, you know, trying to optimize for a traditional OLAP engine.

We're really trying to

put a SQL relational interface on top of this thing so you don't have to think about it too much.

That said,

I think the biggest departure from your

typical relational database or columnar store

are set fields. And I'll explain what I mean by that.

Because of the way a bitmap index works,

where you know, 1 hobby might be skiing. That's a unique value in your column.

For each unique value, you have a bitmap,

and each set bit represents a particular person

that is into skiing. That's their hobby.

But because each distinct value is represented separately as a separate bitmap,

it's very, very easy and natural

to have a set of values associated with a particular person because

everyone has

a certain position in the bitmap that's associated with them, a certain index into the bitmap.

And so every single 1 of these bitmaps, you know, for skiing or Lego building or whatever your hobby is, you have a position in that bitmap. And if that position is set to 1,

you are associated with

that hobby.

So

there's no need to, like, have a special

field type to represent where, you know, 1 person has multiple hobbies. There's no need to have a many to many relationship or a join table or whatever.

It's just

you just set the bit in the bitmap with the hobby the person is associated with.

So

that is very, very powerful and much, much more efficient than trying to do a traditional, like, many to many type relation or or having a special field that can store multiple values

and having to have it like, you know, have a fixed size memory slot available or or be resizable or whatever.

You get this very natural extension to

these set types

that allow you to very, very efficiently represent things like Facebook likes or, you know, anytime you have

what domains has a certain IP address accessed or

behaviors have you observed from a particular particle, you know, with the data coming out of your particle accelerator.

All kinds of things like this

get represented a lot more efficiently.

Whether you

expose it that way through the relational model, through the API, like, you don't necessarily need to, but under the hood, you can represent it that way.

It's just you get a massive amount of compression and a massive reduction in computational

costs when you're computing

where clauses and and aggregates and that sort of thing.

From hearing you talk about things like Facebook likes and particles,

I'm also curious about some of

the data modeling approaches that it can support, where with columnar data, it's generally

trying to do aggregate analysis on these various attributes. Row oriented is generally optimized for relational.

But from what you're discussing, it also seems like you could potentially even start to branch out into the graph domain and being able to do some network queries of the ways that some of these different attributes are connected to different entities.

Yeah. That's that's a really interesting point.

Early on, we thought a lot about graph use cases because when you think about it, you've got a bitmap. You know, it's a sequence of bits. And really, for each value, you have a different bitmaps. You can think of it as a bunch of bitmaps stacked on top of each other. Alright. Well, that's a bit matrix.

What's 1 way to represent graphs? An adjacency matrix. You say you have this very efficient and scalable way to represent

a graph of almost arbitrary size

and to compute on it because everything is stored as compressed bitmaps.

You know, if if you want to

compare,

you know, the connections of 1 thing to the connections of other things, you can do that very efficiently.

So I I do think

there's a lot of interesting

possibility there.

And I will say that 1 of our perennial problems has been focus, and

I don't think we can necessarily

go heavily down the graph path right now when we're kind of focused on the, you know, just

being a reasonable, like,

looks like a relational database,

really good at analytics and stores things as bitmaps under the hood. And I think if we nail that and we nail the, like, cloud version and the serverless component, which I hope we get time to talk about later,

people are gonna be able to build all kinds of crazy things on top of this that do things that we hadn't even considered.

So

we've thought about the graph thing a lot.

We're not focused on that path right now, but but I do think the underlying representation

has a lot of a lot of promise in that department.

Keying off of what you were just saying about the cloud service and the serverless approach, I'm wondering if you can give an overview about some of the notable changes

and some of the evolutions that the feature based project has gone through

in recent years, I guess, specifically in the past 3 years since I first talked about the engine and the 2 years since I talked about the product side of it?

Yeah. So when we originally launched, it was open source, and we

we developed

in open source and built up something of a small community.

And that was really cool, but

it was hard to

get paying customers. And after a while, investors get, you know, get a little nervous when you when you don't have, you know, lots and lots of paying customers. We kind of

decided like, alright, we need to

pull back focus from open source, which because because that actually takes a lot of time and resources to do that. Right? And, you know, you you can't make as many breaking changes and you have to manage things very carefully. So we decided, you know, if we we go close source, we deliver this as like an on prem software we can move a lot faster for a while. And that's what we did. And we're actually fairly

successful in doing that and and we got some pretty large enterprise customers. And then, you know,

come, I don't know, 6 months,

And then,

you know, come,

I don't know, 6 months to a year ago,

we were like, okay. I think, you know, it's time

First of all, for 2 things, basically, we need to have a Cloud product. And then we've done that for a while and and had kind of been planning that for a while. But, you know, there's no way that as a sort of modern data business, you can't

you can't be in the cloud. Anything on prem is declining and, you know, the usage of cloud products we see, you know, Snowflake obviously took off like crazy

in the past few years.

That's gonna be the way of the future.

We knew we needed to have a good cloud offering,

and we knew that

it's pretty hard to get people to trust a database that isn't open source.

And it's really healthy for a database to to have to be

open source and have that community,

you know, of people

using it and

finding problems, helping fix them, finding new use cases.

So we knew we wanted to focus on those 2 things.

And so we hired some expertise that, you know, that had built cloud products before, especially, you know, cloud data products to help us with that. And we

hired some folks to help us, you know, build a community and manage the open source side of things.

And we reopen sourced

feature base with everything that we developed in in the past few years, which was actually quite a lot. We totally rewrote the storage engine and built it

off of BeeTrees to basically be a lot more scalable,

improve a lot of the issues we'd found in the original storage engine,

and and to

have transactional guarantees at the the shard level, really really good asset guarantees.

So a lot went into that.

Now the focus

is cloud. And now that we have our

sort of base cloud products in place, which is which is basically just we run a feature based cluster for you in the cloud.

Now we're focused on

the next step, which is

serverless.

Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration.

All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data.

SQL Lake supports a broad set of transformations,

including high cardinality joins, aggregations,

upserts, and window operations.

Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose.

Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver

today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs.

Before we dig too much into the serverless aspect,

as far as the broader data ecosystem, that has also been going through a lot of changes, and those changes seem to just be accelerating, especially in the past couple of years.

And I'm curious

what the

major kind of motivating

forces in the data ecosystem and the cloud and technology landscape

have been most influential

in the ways that you think about the focus and the scope of the product that you're offering?

Probably the most

influential thing has just been kinda watching

the rise of

these large language models and just all the kind of AI stuff coming to a head, I think. You know, for a long time, you know, it's kinda been the joke among engineers that, you know, all of AI boils down to, like, linear regression a lot of the time. You know?

I think we're starting to get to a tipping point where

these large language models are doing some pretty cool things,

and

I can only imagine that's gonna accelerate. Like, I I think this stuff is the real deal.

We want to

be positioned

to be able to help

power that because it's all built off of huge

datasets. It's all built off of huge datasets and being able to analyze them and iterate on them and,

you know, find interesting clusters and all kinds of things like that.

And I think

because of the way

that feature based naturally represents things, it's basically doing

a categorical mapping of of all values into

numbers as

a consequence of how we have to ingest data to map it into a bit matrix.

Every value gets mapped to a number, which is exactly what you want to do when you want to do a neural network or any kind of

machine learning on data. Right? You turn strings into numbers, but before you do anything.

And then

we represent

all of the relationships

in the data

in the most compact way

that I can imagine. Right? It's as as compressed bitmaps. So you you have a single bit representing a relationship.

And then furthermore, when, you know, when you shove a bunch of these together, you compress them as much as possible.

But you compress them

in a way that is

computable. You're not just, like, you know, running it through a general purpose.

You're not you're not just, like, zipping it. Right? You're compressing it in a way that you can still compute on. And and the underlying technology there is called roaring bitmaps, if anybody wants to take a look at it there. But it's basically,

you look at the data and there's 3 different encoding types based on the density of the data,

And every operation is defined on every pair of those encoding types. So you're never decompressing the data to operate on it. You operate on it in place. And that is totally transformative

because typically you think of compression as a trade off between like compute resources for space, but it's not a trade off anymore. It's just you make it smaller. And because it's smaller, it's faster to operate on because you're never you're never decompressing it.

The trade off now is implementation complexity.

The implementation is more complex, but I think that's just the way of the future. Like, software gets more and more complex to make everything better. Right? Like that I mean, that's just how the world works now.

It's really, really interesting.

But kind of circle back to the question, I got a little excited there. Being able to support,

you know, the future, which is largely gonna be driven by

machine learning

and AI, you know, being fed huge amounts of data and being fed the right data,

being fed

clean data, you know, I think being able to analyze that data and get it prepared is just crucial. And so putting the tools out there to do that

in a way that is sort of infinitely scalable

is basically what it's all about now.

Hearing you talk about the application

of feature base in the machine learning

ecosystem

and the fact that all of the data is already represented numerically

also puts me in mind of another trend in the database market of vector databases.

And I'm wondering what you see as the comparable capabilities of feature base as compared to things like PineCone or Milvus.

I'll be honest. I haven't

delved into vector databases

deeply.

I think that they are sort of fundamentally

transposed

from what feature base is doing, and I'll explain what I mean by that.

I think

a vector database I may be wrong about this, but my sense is that a vector database is taking

an entity, a record essentially, and and representing

that entity

as a

vector, you know, as a vector of bits or a vector of floats

or whatever.

Feature base

is actually taking

every individual value

from every column that represents that entity. So if your entity is a is a person, you know, you might have column age, column

name, you know, column hobbies, whatever.

Feature base takes all the individual values out of those columns, represents each of them

as a bitmap. Well, we do a little bit differently if you have, like, numeric data. But that aside,

a vector database is,

I believe, still sort of focused on representing the entity as something, whereas we're focused on sort of breaking the entity down,

scattering it across,

you know, a whole bunch of different places in memory and on disk because every

every distinct value of a column

is sort of addressable. But to put a record back together, you have to kind of go look at all those different values and reconstruct things.

And that's why earlier on, I said

it's a trade off between transactional and analytical because

for transactional workloads, you're really interested in getting the whole record back. For analytical workloads, you're generally not. You're interested in

answering aggregate questions about

particular columns or particular values in the dataset.

That's the fundamental trade off. Yeah. It's definitely interesting. And

digging deeper into that data modeling question,

I'm wondering what you see as some of the

core concepts that people who are using feature based need to understand in order to be able to make proper use of it and how they need to think about the

specific

attributes that they want to

decompose into those bitmaps and how to think about, you know, what are the aspects that we want to be joinable, what are the things that we only care about in isolation,

and

some of the ways to convert their existing, maybe relational data or, you know, semi structured or unstructured data

into

the representation that feature base operates best on?

I'll say again, like, our ultimate goal is you don't have to think about any of this. That's not

entirely true today, though. It does behoove you to think

about

how things actually look, how things are represented

under the hood. Now,

for the most part, you can ingest your data

as normal. Like, I mean, we're now supporting,

like, bulk insert SQL statements where, you know, you can just jam, you know, a huge number of records in and we take care of all the

under the hood machinery to transform that into bitmaps and so on and so forth. But if you do think about

how it ends up being represented under the hood, especially with regard to set fields, like, that's the main thing where

if you'd have a relational schema where you have something

that is tracking, you know, a set of values being associated

with a particular

record

in a particular column,

you definitely wanna use a set field for that. You do you don't wanna try to have a many to many relation and have a separate table.

And there's actually

built on top of that even.

There's something called a time quantum field

that gives you that set functionality

and also gives you the ability to associate

a time stamp. Of course, great time stamp can be down to hourly

at most,

but it gives you the ability to associate a time stamp with each value that's associated with each record. So you can actually say, like,

so and so listened

to episode 17

of Tobias' podcast

on a particular day or at a particular hour.

And then they actually listen to that same episode again the next day, you know, at a different hour. And you can track that all

within the same column, within the same table of your schema from a storage perspective without sort of going outside of that.

That can be really, really powerful because you're adding that time component

to the set field. You know, the set field sort of being you that natural power to represent multiple values

without any overhead.

Now you can have multiple values and each value has its own course grade time stamp associated with it. And you can query across those and say,

you know, give me how many people

listen to

episode 17 of my show

in February. You can ask questions like that and then operate on that set of people.

So I think those are the things you need to think most about when you're,

you know, you take a traditional relational schema or whatever you're doing in another database and you want to move it to feature base.

So those set fields and then the time quantum fields on top of that

are the most important things to understand.

There is

some support to do

joins, and basically you can have

1 field

represents a foreign key. And so we would store that

as an integer. And we have a really interesting way of storing

integers in bitmaps called bit sliced indexing, which I won't try to explain

right now. But but basically

if you have you have integers of

any range, like, you know, you can have 64 bit integers.

We basically encode them in 64 different bitmaps instead of the typical bitmap indexing. You you have to have a bitmap for every unique value.

But if you have 2 to the 64 bitmaps, like, you're gonna run out of memory. I'm sorry. So you can't do that. But what you can do is what's called bit sliced indexing where we break out each binary digit of

the number and store it in a separate bitmap. And it turns out we can reconstruct arbitrary range queries across those in in a pretty efficient way.

So that's how we hold basically the foreign key references.

There's a special query in the storage engine

that will take things that are stored

in a bit sliced index, and you can say, like, give me all the unique values

in this field

as a bitmap.

So I'm taking the values that are stored in this column, in this bit sliced index column,

and getting them out as a bitmap.

And that bitmap will be applicable

to whatever table that foreign key references to. So I can go use it as a filter on that table and union it and intersect it with the bitmaps of that table. That's how joins work. Under the hood essentially is through these foreign key references.

So you do still have that capability,

but what we most often find is that with set fields and time fields, you have to do a lot fewer joins. And that's where the real performance benefit comes from.

You mentioned that the overarching goal is that as you're loading data into feature base, you don't have to think about how to actually convert your existing representations

into what feature base actually wants.

And so I'm curious if you can talk to some of the ingest path and some of the surrounding tooling that you're building to be able to make that a more seamless experience and so that people can just throw a feature base at the problem, get the performance and

analytical capabilities

that they want without having to do a bunch of planning and extra engineering effort.

Yes. Yeah. Absolutely. So if you're familiar with, you know, for Pelosa from back in the day, if any of your listeners ever use that, you basically had to set up

a separate process

that was running like a in a separate binary.

And it was gonna pull either from CSV files or from a SQL database

or from Kafka.

And it would do a lot of the heavy lifting of

data transformation before sending that data in sort of a bitmapified form

up to the feature based cluster.

We still have that pattern under the hood because it helps offload a lot of load from from the main cluster, which is really nice to sort of be able to scale your ingest workload from your query workload independently.

However,

with the SQL engine that we've been building,

we are exposing a lot of that same functionality

through

SQL. And in the Cloud product,

This will sort of seamlessly just say, like, you know, if you're doing an insert statement, the query planner will know

that a lot of this data can sort of be

shoved off into

your ingest workers that are running all the data transformation components and your queries can be passed through to your compute nodes that have all the data and are processing doing query processing.

But even in just a standard, you know, feature based deployment,

the ability to just insert some data, you know, at the command line, you know, just to see how it works. I mean, it can't be overstated how

nice that is just for getting started and and figuring things out rather than having to mess with a separate tool and figure out, oh, I need to use, like,

the CSV version of this tool to ingest these CSV files and I have to get the headers just right and to, you know, get all the mappings. Now it just looks like a SQL insert statement who you used to. Right? You set up your schema, you say insert into,

here's all the data and and you can even do some inline transformations. It's something we've got coming up, which I think is gonna be really, really nice for people.

You get this kind of, you know, to excuse the cliche, a single pane of glass. Right? Everything goes through SQL,

and it sort of just figures out under the hood what it needs to do

with that. And because SQL is, you know, a language and we have control of the whole

parser and can sort of add whatever we want to it, we can add

arbitrary capabilities to ingest right in there. You know, data transformations,

mappings, you know, casting things, you know, 2 fields combining them into 1, you know, whatever you really wanna do. Having that

all go through that 1 interface

and then

for us to be able to, on the back end, like, optimize it as we see fit, split things out, you know, move things where they need to go, I think it's gonna be really, really nice.

The mention of SQL also brings up the

end user experience of working with feature base. And I know that there is a PQL language that has been supported and that the SQL interface is a newer addition. I'm curious if you can talk to some of the

kind of user experience.

I don't know if design is the right word, but some of the efforts that you've been putting into

the user experience aspect of working with feature based to make it more approachable without having to

do a bunch of custom training or, you know, self directed learning to try and figure out how do I actually, you know, take this thing that seems really interesting and make it fit into the box that I understand.

Yeah. I you know,

SQL is just a huge

part of that because even if even if you don't know SQL, like, you're familiar with it. Right? Like, it just comes up everywhere.

And so having those common abstractions, we can just say, like,

hey, you run this client, tell it where feature base is, and it's gonna give you a command prompt that you can type SQL into it and you can basically

manage all your interactions from there. Or, you know, if you're interacting programmatically,

you're just making HTTP requests and sending

SQL strings.

Like, anybody can do that in any programming language.

It just greatly, greatly simplifies things from,

you know, what we had

in very early days,

PQL,

which at that time stood for Pelosa query language,

was a language that basically mapped directly to what was available in the storage engine. Right? It's like it was like take this bitmap and intersect it with this bitmap and count the number of set bits in the output.

And of course, you know, we've known since the eighties or before that

having your

the language in which you talk about your data be

sort of abstract and separate and, you know, from the data representation

is a really powerful and important property.

So it was clear early on that we were going to have to move away from pqual, which is basically the assembly language of the database. Right? It sort of maps directly to what's available

at the storage layer

and move up into a declarative,

sort of more abstracted language that allows you to just talk about your data

agnostic of the underlying representation.

So

SQL was the obvious choice to do that because it's already out there. Everybody knows it. It has its warts.

No question. But we don't feel the need to try to define an alternative query language, you know, that solves all the problems, you know, along with everything else that we're doing.

SQL ultimately

will get the job done and

with the added benefit of people already know how to use it and, you know, where to expect things to be kinda weird. Like, oh, like, how do you handle nulls and, you know, that

kind of thing. So I think it really is all about

making that 1 consistent interface, using that everywhere,

because in the past we've had like, oh, we we expose the gRPC interface and

we actually mimic the Postgres wire protocol. So you can use a Postgres client to talk to feature base and send people or

a subset of SQL over that. So we had all these, like, really cool projects where we explored a lot of different things.

But ultimately,

if you wanna build a cloud product and you want it to kinda work everywhere,

you need to like, you know, there's there's middleware, there's proxies, there's all kinds of things out there on the Internet. And HTTP is is kind of universally supported. The authentication mechanisms are

well known and well understood.

You just use HTTP

and you send SQL over it. And that is a sort of a universal

interface that that everybody can kinda grasp on to and understand.

And then we can focus on

explaining the parts that are actually need to be explained, like the set fields and the time fields and, you know, the things that are sort of fundamentally different at the storage level that you wanna be able to take advantage of.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools.

Sign up for free today at dataengineeringpodcast.com/rudder.

As far as the kind of security elements of it, given the fact that you are building a cloud product and that you're also investing in a serverless approach,

particularly interested in understanding

some of the ways that the underlying storage and the ways that you've architected the engine is reflected in the security model and how you think about multi tenancy and scalability

of the engine to be able to

support that cloud product approach?

Yeah. Oh, man. There's so much to unpack here. So

all the basics that you'd expect, like, you know, you you sign in, you get a token, and everything's encrypted.

All the network communications are encrypted and so forth. Where I think it starts to get really interesting

is

what sort of level of granularity

we can expose

in terms of

permissions and controls?

Because

there have been

some databases

that came out, you know, in the last decade or so, especially in the intelligence community

that,

you know, they're 1 of their big selling points is is like cell level access control. Right? Like you can grant

individuals

access to

particular

columns of particular records at a really, really granular level. That's kind of a cool

useful feature, but it also kind of has a high

performance cost potentially.

I think with the way things are represented

in

feature base, we we actually

have an opportunity

to do this in a really efficient way. Now we haven't done a lot of this yet, but you can think about, okay, so if I wanna give someone

access to a subset of the records

in a table,

that's just a bitmap filter that I'm gonna apply to all their queries on that table. That's

obnoxiously fast. Right? There's almost no overhead to do that. So I just have to store that bitmap for that person

with and probably, like, a hidden field on that table. And, you know, it says, like, this person has access to these records, and that gets

basically added to all their queries by the query planner.

If you wanna give act someone access to just certain

fields,

that's pretty easy too. At a high level, you store that metadata about

what fields they have access to.

And because,

you know, like a column or database, everything is sort of broken out. You know, if they want to do a big like a backup or a dump of data,

it's pretty easy to just read

the columns

that they're interested in. You don't you don't have to go through and, like, filter

out at a granular level what columns you can export for this person. You just

go export the columns in question. It's not like you have to scan a table and remove them.

So that's really nice.

1 thing

other databases could do everything I've talked about so far. I think the 1 thing that feature base could do uniquely is that we store

the bitmap data separate from the

key data. And by the key data, I mean, what string

does each bitmap map to or what

record

does each position in the bitmap

map to? So you can imagine over here, we've got this big compressed bit matrix

bit bit matrix. It's just a bunch of, you know, zeros and ones. And then over here,

we say, here's what each row in that matrix maps to, You know, skiing and Lego building and, you know, whatever. And over here, we store here's what each

column maps to. So let's say you wanted to run some

clustering algorithm across your whole dataset. You know, you don't wanna do some really

intensive

machine learning. You wanna pick out clusters in your data. It's quadratic or maybe even worse than that. So you don't wanna do it locally where you're running. You wanna ship this up to, you know, the cloud where you have

elastic compute resources available to you.

You can just, in theory, just send up the

bitmap data

without sending any of the keys that go along with it.

So

you're exposing far less information, you know, potentially if there's some breach in the cloud environment or whatever.

You're only exposing this big old matrix of bits that that no 1 has any idea,

you know, notionally about

what those actually map to rather than just exposing the whole dataset.

Not to mention, you know, you use a lot less bandwidth because you're you're only sending

these compressed bitmaps rather than the sort of key value translation stores. I think that's an area where,

again, you know, our problem is always focused. Right? We have to decide what to focus on and where to spend our resources because it turns out building a database is a lot of work. But I think that is an area where we could have some really cool, like, product opportunities.

Another aspect of what you're building that happens no matter what type of data you're working with is the question of schema evolution,

where with relational engines, you can add a column or alter a column or create a new table.

But because of the fact that feature based represents

every kind of attribute as its own

discrete kind of unit of data and the associated bitmaps.

I'm curious how that plays into that question of schema evolution and the ways that the data

changes as you progress through kind of evolutions of the products and the information that you're trying to represent and work with?

Adding and removing columns is actually really easy because, I mean, just like a column or database,

all the columns are stored separately. So

it's really no problem

to add or remove a column. Now if you want to

modify,

you know, a column like you wanna represent your integer differently

or something.

A lot of that stuff is

notionally possible and will have no more overhead than it does in your typical relational database.

But we haven't implemented

a lot of it. So

there's a few, like, alter column type things that are supported.

We recently added the the ability to

add a TTL

to a time column so you can sort of age off old data automatically.

You can turn that on and off or or tweak it. But for the most part, the way that our customers are evolving their schemas is through

adding and removing whole columns right now, which has actually worked out okay. It it hasn't been sort of a major sticking point, but I'm sure that we'll get more demand for

being able

to alter columns and, you know, sort of evolve the schema in arbitrary ways

in the future.

I don't think there's any, like, fundamental reasons

why,

you know, a different database would be better or worse than a lot of these things. A lot of these are just they're really like heavy data transformation operations

that

you just have to sort of support and and you try to support them in a way where

the database stays live, you know, while it's happening. So you're not you're not, like, overwriting the data in place. You know, you make a copy, you make some changes, and then you flip over

when it's

done. I think all of these things are doable. It's it's just a matter

of implementation time and and, you know, deciding that it's worth it to put our resources there and actually

expose those capabilities.

As far as your experience of working on feature base

and evolving the project and the product

and keeping an eye on the broader data ecosystem and how your database engine is being used, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

You know, I think 1 that came up a little while ago was

some folks were

they were doing kind of it was kind of a security

use case where they were taking

lots of different

application binaries

and

breaking them up in various ways and hashing parts of them and, you know, like, you know, breaking out different parts of the binary

and then sort of storing

all these different hashes and things as features in feature base.

Features just sort of being distinct values of a column, but but you can you can see where some of the naming comes from there. And then

using that to sort of very quickly and at scale

detect

whether some new binary or some new artifact in somebody's system

might have malware

associated with it and sort of assign it a score. If that explanation sounds kind of like a little fuzzy and hand wavy, it's because I don't actually fully understand

what they're doing. And that's actually 1 of the most exciting parts about it to me is

you're getting value out of the software we developed in a way that

I don't fully understand. Like, that's really cool. I think, you know, that speaks to,

you know, this thing having

broader applicability than than sort of what it was born from and where it came from, which is really exciting to me.

In your work of helping to build and evolve the engine and starting to build out this cloud product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I knew that

building a database was was a huge amount of work. I mean, you see it quoted like, you know, databases,

a 100 engineer years to mature a database or and you see that kind of thing thrown around.

What I didn't realize is that I think a lot of those engineer years are spent working on CI

and and, you know, testing and

getting your tests to run

reliably

and testing at scale and making sure that you clean up your scale tests after they're done so you're not spending a fortune in the cloud

and getting your test to run reliably and not, you know, be sort of flaky and fail.

I feel like there's an old quote from someone who,

you know, they

it's like 1 of the first bugs that was found and some someone realized that they were going to spend a significant fraction of the rest of their life searching for bugs in their own programs. You know, this is back in like the forties or something. And I was like, I think I just realized recently I'm going to spend a significant fraction of my life like dealing with CI. Right. Because

you have to have CI, you have to have

really, really good test coverage.

Anything that's not tested in CI, you can assume it doesn't work, basically. That's kind of my motto. Like, if it's not in CI, it doesn't exist

because

things

will

break arbitrarily. You know, when when you've got a dozen developers or more working on a product, changing things, adding features,

You can't watch everything all the time. If you want to make sure that something is going to work and nobody's going to break it, you better have a test for it and it better run

in CI. And then you better go and like look at CI from time to time and make sure that it's not failing silently because because that'll happen too. So maybe not the most exciting answer, but,

yeah, it's just really, really important and really, really hard and time consuming to get right. And I think that's where a lot of

the efforts in building a system like this need to be focused. Absolutely. Yeah. Particularly for a a storage engine where people are trusting it to be 1 of the most critical elements of their infrastructure. Because

if your web server goes down, well, you just spin up a new 1. But if your database goes down, if you don't have backups, then you're out of luck and you just threw away, you know, however much time of your company you spend in collecting all that information.

You gotta have backups. You gotta test your backup. You gotta test your restoreroutines.

And

databases, I think, are they're particularly

in need of testing, and they're particularly

difficult to test because you you need to have giant, like, fixtures of data to be able to shove into them, and then you need to, like, know what the right answers are for queries, and you have to have those stored. And,

yeah, there's a lot to do. Absolutely.

And for people who are

interested in trying to speed up their analytics

and

be able to,

you know, query across larger datasets? What are the cases where a feature base is the wrong choice?

I like the way you asked that question because it sort of precludes someone using it for a transactional workload, which you definitely should not do. I used to in demos,

I would do, like, you know, select star limit 1 type query, and then I would do, you know, like

a count

where a bunch of different things. And, you know, the select star takes, like, 75 milliseconds

to run and the count across the whole dataset takes like 3 milliseconds. I'm like,

if

if your primary use case is like you need to, like, reconstruct whole records and get individual records back,

you're not gonna have a good time with feature base. If your use case

does a lot of filtering

and

aggregates,

you need

4 things. You need low latency queries. Right? So either because you're an impatient person or because you're powering

a web page, you know, that you want to be very responsive.

So you want low latency queries.

You want

fresh data. So you don't want to be querying data that's a day old that you ran in an overnight batch routine. You want it sort of live ingested. That's something we've spent

a ton of engineering resources to have that capability and feature base. We're sort of micro batching incoming data and applying

it live. And so, you know, within

a second, you know, of data being generated, it's available for query in the database.

So you want low latency queries. You want fresh data.

Potentially,

you need

high concurrency

queries. So you've got potentially lots of users using the system.

You know, I think like typical analytics use cases are like business intelligence type things, you know, where you might have some like 1 analyst poking around

at a, you know, a SQL prompt. This is more like you're powering an application

with this thing that you're exposing to a broader audience.

So you have more query concurrency

coming into it. So

on the query side, low latency queries, high concurrency queries. On the

ingest side,

very fresh data. So that's sort of latency on the ingest side, you know, freshness.

Your data is available as it's coming in

and

high throughput data. So you can you know, maybe you're generating,

you know, 500, 000 records per second and you want those to be freshly available for analysis.

If you need all those things,

I think you definitely need feature based.

If you need, like, 2 or 3 out of 4,

you might wanna look at feature based. If you don't need those things or if you're doing transactional workloads

or,

you know, if you don't have 1, 000, 000, 000 and 1, 000, 000, 000 of records,

you know, right today, feature base is probably more trouble than it's worth in terms of, you know, the operational overhead and the maturity of the tool and everything.

Come back in a year or 2, I think it's gonna be just as easy to use as, you know, most other databases and then, you know, maybe it maybe it just makes sense anyway.

But, yeah, I think it's about

those 4 things. The latency and throughput on the ingest side and the latency

and

concurrency on the query side.

If you're doing analytics and you need to have that sort of need to hit all 4 of those areas, I think you should take a very close look at future base.

As you continue to build out the open source project and the cloud offering, what are some of the things you have planned for the near to medium term, either for the core engine or just for the ecosystem around it to improve the user experience?

In terms of the user experience,

it's basically all about SQL. Getting more SQL support, getting things optimized

such that, you know, that, you know, it's not just technically available in SQL. It actually, like, works at a reasonable speed and everything.

That's really the main thing on the user experience side.

I guess

the main other thing on the road map is the serverless stuff, and that will definitely affect the user experience because

you can sort of stop thinking about

your feature based deployment as this, like, cluster of nodes and just start thinking about it as, like, I've got some databases in here.

And

if I add more data, like, I don't have to think about anything. Right? And on the back end, you know, we're running it

in Kubernetes or ECS or whatever, and we're adding new containers and rebalancing data

as needed.

And

our hope is that we don't

expose too much of that. I mean, we'll expose whatever we need to. So if somebody wants to say, like, hey. You know, I want some dedicated resources and I want to scale them up to x y z because I know I've got a big event happening. You know, I think we probably will end up exposing some amount of that, but we'd love for it to just be a pretty

seamless experience where you where you think of it like s 3. Right? Like, you you don't think about scaling up your your s 3 deployment. You just shove a bunch of stuff in there and don't worry about it. And, you know, it'll be there when you need to get it out.

Are there any other aspects of the work that you're doing on feature base and the database engine and the products you're building around it that we didn't discuss yet that you'd like to cover before we close out the show?

So we can talk about set fields and time fields, which I think are the most important from a,

you know, kind of usage perspective.

I think the thing I'm most excited about right now is definitely the serverless stuff, and that's gonna be launching

to the public probably early next year to where you can actually get on to feature based.com and and spin up a serverless.

I think that's just gonna be really cool and and really transformative. You know, where you can get on there, bulk insert a bunch of data and just not even think about

the resources

that are behind it. I can't tell you how many conversations I've had

about, like, how do I size my cluster? Because when you can't easily scale it up and down, like, that that's really important. Like, what types of nodes do I pick? How much memory versus CPU? And for that all to kinda just go away and for it to

just dynamically rebalance in the background is gonna be so huge. And and there's a lot of interesting, you know, architecture and computer science problems under the hood that we get to work on. I think other than that, I think we covered a lot of really good ground here. Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or even contribute to the engine given that it's open source, I'll have you add your preferred contact information to

the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I wanna give, like, a meta answer to this question. I think there's a lot of gaps and a lot of things that need to improve.

I think the thing that will improve all of them the most

are gonna be

better programming languages.

Everything kind of stems from that. Like the whole

the iteration speed for developers to improve things

stems a lot from the programing language that you choose.

We use Go. I love Go. It has plenty of warts, and I'm not convinced that any of the modern languages in use today

have, like, solved all the problems, you know, which is a silly thing to say. Nobody's going to solve all the problems for a while. But I think there's just there's still a ton

of benefits to be had from improving

programming languages and development environments,

specifically with a focus on

iteration speed. Like, the more you can decrease your cycle time for testing changes and tweaking things and playing with things, I think the better off

the world will be, you know, in terms of the speed that we can

create new things and improve

all other aspects. So that's just kinda where my head's been at lately.

Sorry to take that question a little bit off the rails, but I think programming languages are vitally important

and

it's actually difficult to see,

you know, they really only get worked on if they're sponsored by, like, a huge company. Right? Because it's hard to see how you would make money, you know, building a programming language as a business.

That's something to look at and think about in the future is

how those sort of very fundamental tools are going to evolve because I think they can have a huge impact. Absolutely.

So for 1, there are no rails to this question. That's the point of it.

But I definitely appreciate that perspective because, yeah, as you said, the programming language, as with all language, really shapes the ways that you're able to think about given problems.

And for the problems that we're starting to address,

we need to start coming up with new ways to think about them, particularly as we move into

new areas of

architecture and infrastructure

and problems that are trying to be solved that haven't been solved yet or haven't been solved well. So definitely appreciate that perspective. So thank you again for taking the time today to join me and share the work that you and your team are doing at FeatureBase. Definitely a very interesting project. Excited to see some of the directions that you're taking it, and I hope you enjoy the rest of your day. Hey. You too. Thank you so much for having me, Tobias. This has been a lot of fun.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links