Summary
The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
- Your host is Tobias Macey and today I’m interviewing Matt Jaffee about FeatureBase (formerly known as Pilosa and Molecula), a real-time analytical database engine built on bitmaps
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what FeatureBase is?
- What are the use cases that it is designed and optimized for?
- What are some applications or analyses that are uniquely suited to FeatureBase’s capabilities?
- What are the notable changes/evolutions that it has gone through in recent years?
- What are the forces in the broader data ecosystem that have had the greatest impact on your project/product focus?
- What are the data modeling concepts that platform and data engineers need to consider when working with FeatureBase?
- With bitmaps as the core data structure, what is involved in translating existing data into bitmaps?
- How does schema evolution translate to the data representation used in FeatureBase?
- How does the data model influence considerations around security policies and governance?
- What are the most interesting, innovative, or unexpected ways that you have seen FeatureBase used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on FeatureBase?
- When is FeatureBase the wrong choice?
- What do you have planned for the future of FeatureBase?
Contact Info
- jaffee on GitHub
- @mattjaffee on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Your host is Tobias Macy. And today, I'm interviewing Matt Jaffe about FeatureBase, formerly known as Palossa and Molecular, a a real time analytical database engine built on bitmaps. So, Matt, can you start by introducing yourself?
[00:01:56] Unknown:
Matt Jaffe. I've been working at Futurebase since it was Molecular and before when it was Telosa, and before that at the company we spun out of, which was called Umbl back in 2015. Prior to that, I was a government contractor for about a year and a half after I got out of graduate school where I studied computer science, focus on distributed systems and networking.
[00:02:20] Unknown:
And do you remember how you first got started working in data?
[00:02:24] Unknown:
Yeah. So I guess if you're in computer science and you're in software these days, you're gonna crash into data pretty quickly. I was in grad school. I was thinking about getting a PhD. I really just had an itch to write a lot of software and was kind of frustrated just writing papers. So I ended up taking a job at at a government contractor where I worked. I got to do a a wide variety of stuff there, actually, everything from building GUI programs in Python to inspecting network traffic on a GPU and some stuff in between. But, ultimately, wanted to live someplace different and wanted to kinda have a different workplace culture, and that's when I found Umbble in Austin, Texas.
So I joined Umbble, which was a kind of a SaaS platform for marketing teams to understand their audience better on, you know, be able to collect data and figure out who their audience is and how to market to them. And that's kinda all I thought it was when I joined, and I think it was about 2 weeks in after being there that I was, you know, talking to the chief architect and then he was telling me about this, essentially, this distributed database that they built internally to handle some of the more difficult queries. And I mean, this thing was wild, like, it was written in Go, which was very new at the time. You know, it it was like it was a distributed system and and it was storing data in this really strange way where where everything was encoded as a bitmap.
And so I, you know, I was just taken with this. I was like, I want to work on this thing, like, when can I start? And about a year and a half later, we had the opportunity to spin out a company just around that piece of the infrastructure. So that was Pelosi. That was really exciting. And basically, ever since then, I, you know, just been developing a database. So learning a lot about databases and the data engineering that, you know, we're getting data to end up in the database and, you know, how to make the database behave the way you need it to for what you wanna do. You know, it's it's all connected.
[00:04:25] Unknown:
And so in terms of the feature based project and product itself, done interviews previously about the Pelosa database engine before it became feature based and about Molekula, the business. But for people who haven't listened to those, I'm wondering if you can just give a quick overview about what you're building at FutureBase and some of the use cases that it is designed and optimized for.
[00:04:50] Unknown:
We've had a a long journey. So basically, the core tech has never really changed. It's a database that is built on a pretty unique storage format and that it uses bitmap indexes as the primary data representation, you know, on disk and in memory. And and everything is sort of built around that. You know, the query optimizer, the query planner, the execution engine is all built with bitmaps in mind to take maximum advantage of them. And it's really there's a lot of sort of understood, you know, truisms about bitmaps and bitmap indexes if you're into databases that don't actually hold with some of the modern approaches to to bitmap compression and some other techniques that have, you know, come around even in just the last decade or so.
We've really found that the use cases for these things have exploded so much so that we've gone from a very specialized tool that was built to handle a certain part of the query workload in a certain product to we just have a relational database, you know, built on top of bitmaps. And we're constantly expanding functionality around that where you can do all kinds of SQL queries and basically do whatever kind of query you need to do, but very much with a focus on analytics. So there's so many dimensions that you can, look at databases on, but I think the main spectrum is just from transactional workloads to analytical workloads, and feature based is dialed all the way to the analytical side. I think because it's built entirely on bitmaps, the trade offs are more in favor of analytics than any other database out there, which does mean that the transactional workloads suffer. You know, we're not getting something for nothing, but the analytical workloads can be incredibly efficient.
[00:06:50] Unknown:
As far as the capabilities of Telosa, because of the fact that it is operating on these bitmap indexes and has these efficiency gains as a result. I'm curious if there are any types of analysis or any types of the to eliminate components of architecture that would otherwise be necessary in a more, quote unquote, traditional database engine.
[00:07:23] Unknown:
When you look at analytical workloads generally, there are definitely some that feature base is better at than others. And, well, let me let me start by telling you kind of the original use case that Belloso was developed for, you know, way back when, and then we can kind of expand from there. So back at Umbl, we had this problem where 1 of our main queries that people wanted to do was show me the top interests among people in my audience. So maybe maybe it's the most liked Facebook pages among people in my audience. That's a sorted group buy, essentially. That by itself is okay.
But if you wanna do that same sort of group by on any subsegment of your audience, kind of in real time where you're saying, I just wanna know the top Facebook likes among, you know, females from Massachusetts who like the NBA or any complicated where clause you can think of and then get that top list back. If you've got, you know, 100 of millions of consumers potentially and, you know, the universe of Facebook likes is in the tens of millions or more, that becomes a very challenging query. And we were using Elasticsearch at the time, which, you know, back in 2013, 2014, Elasticsearch was not what it is today and our 20 node cluster was just falling over or, you know, the garbage collector would run and the queries would take 20 seconds or whatever.
So that was the original use case for Pelosa at the time was be able to do these sorted group buys on very specific segments of data in real time because the whole thing had to power a web app, you know, that people were just clicking around in. And some of those page loads in that web app, like you would define a segment in the app and then it would do these top end queries on a whole bunch of different columns in the data all at once. So every page load might be doing, you know, 20 of these queries. And so it it just had to be way more efficient on the back end for us to have a business. Right? That's where it was kinda born from.
You can kinda start imagining from that type of segmentation workload. It doesn't have to be marketing. It doesn't have to be advertising. There's lots of things that wanna do this. Anything from, you know, like, high energy physics to, you know, intrusion detection systems wanna do things like this. That's That's really the sweet spot where you have complicated where clauses and you wanna do sorted group buys or really anything that a columnar database would be good at. And then in particular, anything that wants to look at specific values within a column because that's where feature base really shines is instead of having to scan an entire column because everything is split out into bitmaps, you can actually just scan the data that's about a particular distinct value without having to scan the rest of the column. It's really like you get the benefits of a column or database plus the benefit of being able to sort of address things down to distinct values.
[00:10:21] Unknown:
Because of those combination of benefits, I'm curious if there are any approaches that have been built up around the capabilities of feature base in terms of rethinking the way that you structure the data that you're storing or ways that you think about the query patterns or the ways that you structure your analysis because of those efficiency gains. Whereas you may, you know, optimize in a different direction if you're going for a columnar store or if you're, you know, trying to optimize for a traditional OLAP engine.
[00:10:55] Unknown:
We're really trying to put a SQL relational interface on top of this thing so you don't have to think about it too much. That said, I think the biggest departure from your typical relational database or columnar store are set fields. And I'll explain what I mean by that. Because of the way a bitmap index works, where you know, 1 hobby might be skiing. That's a unique value in your column. For each unique value, you have a bitmap, and each set bit represents a particular person that is into skiing. That's their hobby. But because each distinct value is represented separately as a separate bitmap, it's very, very easy and natural to have a set of values associated with a particular person because everyone has a certain position in the bitmap that's associated with them, a certain index into the bitmap.
And so every single 1 of these bitmaps, you know, for skiing or Lego building or whatever your hobby is, you have a position in that bitmap. And if that position is set to 1, you are associated with that hobby. So there's no need to, like, have a special field type to represent where, you know, 1 person has multiple hobbies. There's no need to have a many to many relationship or a join table or whatever. It's just you just set the bit in the bitmap with the hobby the person is associated with. So that is very, very powerful and much, much more efficient than trying to do a traditional, like, many to many type relation or or having a special field that can store multiple values and having to have it like, you know, have a fixed size memory slot available or or be resizable or whatever.
You get this very natural extension to these set types that allow you to very, very efficiently represent things like Facebook likes or, you know, anytime you have what domains has a certain IP address accessed or behaviors have you observed from a particular particle, you know, with the data coming out of your particle accelerator. All kinds of things like this get represented a lot more efficiently. Whether you expose it that way through the relational model, through the API, like, you don't necessarily need to, but under the hood, you can represent it that way. It's just you get a massive amount of compression and a massive reduction in computational costs when you're computing where clauses and and aggregates and that sort of thing.
[00:13:32] Unknown:
From hearing you talk about things like Facebook likes and particles, I'm also curious about some of the data modeling approaches that it can support, where with columnar data, it's generally trying to do aggregate analysis on these various attributes. Row oriented is generally optimized for relational. But from what you're discussing, it also seems like you could potentially even start to branch out into the graph domain and being able to do some network queries of the ways that some of these different attributes are connected to different entities.
[00:14:05] Unknown:
Yeah. That's that's a really interesting point. Early on, we thought a lot about graph use cases because when you think about it, you've got a bitmap. You know, it's a sequence of bits. And really, for each value, you have a different bitmaps. You can think of it as a bunch of bitmaps stacked on top of each other. Alright. Well, that's a bit matrix. What's 1 way to represent graphs? An adjacency matrix. You say you have this very efficient and scalable way to represent a graph of almost arbitrary size and to compute on it because everything is stored as compressed bitmaps.
You know, if if you want to compare, you know, the connections of 1 thing to the connections of other things, you can do that very efficiently. So I I do think there's a lot of interesting possibility there. And I will say that 1 of our perennial problems has been focus, and I don't think we can necessarily go heavily down the graph path right now when we're kind of focused on the, you know, just being a reasonable, like, looks like a relational database, really good at analytics and stores things as bitmaps under the hood. And I think if we nail that and we nail the, like, cloud version and the serverless component, which I hope we get time to talk about later, people are gonna be able to build all kinds of crazy things on top of this that do things that we hadn't even considered.
So we've thought about the graph thing a lot. We're not focused on that path right now, but but I do think the underlying representation has a lot of a lot of promise in that department.
[00:15:37] Unknown:
Keying off of what you were just saying about the cloud service and the serverless approach, I'm wondering if you can give an overview about some of the notable changes and some of the evolutions that the feature based project has gone through in recent years, I guess, specifically in the past 3 years since I first talked about the engine and the 2 years since I talked about the product side of it?
[00:16:00] Unknown:
Yeah. So when we originally launched, it was open source, and we we developed in open source and built up something of a small community. And that was really cool, but it was hard to get paying customers. And after a while, investors get, you know, get a little nervous when you when you don't have, you know, lots and lots of paying customers. We kind of decided like, alright, we need to pull back focus from open source, which because because that actually takes a lot of time and resources to do that. Right? And, you know, you you can't make as many breaking changes and you have to manage things very carefully. So we decided, you know, if we we go close source, we deliver this as like an on prem software we can move a lot faster for a while. And that's what we did. And we're actually fairly successful in doing that and and we got some pretty large enterprise customers. And then, you know, come, I don't know, 6 months, And then, you know, come, I don't know, 6 months to a year ago, we were like, okay. I think, you know, it's time First of all, for 2 things, basically, we need to have a Cloud product. And then we've done that for a while and and had kind of been planning that for a while. But, you know, there's no way that as a sort of modern data business, you can't you can't be in the cloud. Anything on prem is declining and, you know, the usage of cloud products we see, you know, Snowflake obviously took off like crazy in the past few years.
That's gonna be the way of the future. We knew we needed to have a good cloud offering, and we knew that it's pretty hard to get people to trust a database that isn't open source. And it's really healthy for a database to to have to be open source and have that community, you know, of people using it and finding problems, helping fix them, finding new use cases. So we knew we wanted to focus on those 2 things. And so we hired some expertise that, you know, that had built cloud products before, especially, you know, cloud data products to help us with that. And we hired some folks to help us, you know, build a community and manage the open source side of things.
And we reopen sourced feature base with everything that we developed in in the past few years, which was actually quite a lot. We totally rewrote the storage engine and built it off of BeeTrees to basically be a lot more scalable, improve a lot of the issues we'd found in the original storage engine, and and to have transactional guarantees at the the shard level, really really good asset guarantees. So a lot went into that. Now the focus is cloud. And now that we have our sort of base cloud products in place, which is which is basically just we run a feature based cluster for you in the cloud. Now we're focused on the next step, which is serverless.
[00:18:47] Unknown:
Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration. All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQL Lake supports a broad set of transformations, including high cardinality joins, aggregations, upserts, and window operations.
Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose. Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs. Before we dig too much into the serverless aspect, as far as the broader data ecosystem, that has also been going through a lot of changes, and those changes seem to just be accelerating, especially in the past couple of years.
And I'm curious what the major kind of motivating forces in the data ecosystem and the cloud and technology landscape have been most influential in the ways that you think about the focus and the scope of the product that you're offering?
[00:20:23] Unknown:
Probably the most influential thing has just been kinda watching the rise of these large language models and just all the kind of AI stuff coming to a head, I think. You know, for a long time, you know, it's kinda been the joke among engineers that, you know, all of AI boils down to, like, linear regression a lot of the time. You know? I think we're starting to get to a tipping point where these large language models are doing some pretty cool things, and I can only imagine that's gonna accelerate. Like, I I think this stuff is the real deal. We want to be positioned to be able to help power that because it's all built off of huge datasets. It's all built off of huge datasets and being able to analyze them and iterate on them and, you know, find interesting clusters and all kinds of things like that.
And I think because of the way that feature based naturally represents things, it's basically doing a categorical mapping of of all values into numbers as a consequence of how we have to ingest data to map it into a bit matrix. Every value gets mapped to a number, which is exactly what you want to do when you want to do a neural network or any kind of machine learning on data. Right? You turn strings into numbers, but before you do anything. And then we represent all of the relationships in the data in the most compact way that I can imagine. Right? It's as as compressed bitmaps. So you you have a single bit representing a relationship.
And then furthermore, when, you know, when you shove a bunch of these together, you compress them as much as possible. But you compress them in a way that is computable. You're not just, like, you know, running it through a general purpose. You're not you're not just, like, zipping it. Right? You're compressing it in a way that you can still compute on. And and the underlying technology there is called roaring bitmaps, if anybody wants to take a look at it there. But it's basically, you look at the data and there's 3 different encoding types based on the density of the data, And every operation is defined on every pair of those encoding types. So you're never decompressing the data to operate on it. You operate on it in place. And that is totally transformative because typically you think of compression as a trade off between like compute resources for space, but it's not a trade off anymore. It's just you make it smaller. And because it's smaller, it's faster to operate on because you're never you're never decompressing it.
The trade off now is implementation complexity. The implementation is more complex, but I think that's just the way of the future. Like, software gets more and more complex to make everything better. Right? Like that I mean, that's just how the world works now. It's really, really interesting. But kind of circle back to the question, I got a little excited there. Being able to support, you know, the future, which is largely gonna be driven by machine learning and AI, you know, being fed huge amounts of data and being fed the right data, being fed clean data, you know, I think being able to analyze that data and get it prepared is just crucial. And so putting the tools out there to do that in a way that is sort of infinitely scalable is basically what it's all about now.
[00:23:46] Unknown:
Hearing you talk about the application of feature base in the machine learning ecosystem and the fact that all of the data is already represented numerically also puts me in mind of another trend in the database market of vector databases. And I'm wondering what you see as the comparable capabilities of feature base as compared to things like PineCone or Milvus.
[00:24:12] Unknown:
I'll be honest. I haven't delved into vector databases deeply. I think that they are sort of fundamentally transposed from what feature base is doing, and I'll explain what I mean by that. I think a vector database I may be wrong about this, but my sense is that a vector database is taking an entity, a record essentially, and and representing that entity as a vector, you know, as a vector of bits or a vector of floats or whatever. Feature base is actually taking every individual value from every column that represents that entity. So if your entity is a is a person, you know, you might have column age, column name, you know, column hobbies, whatever.
Feature base takes all the individual values out of those columns, represents each of them as a bitmap. Well, we do a little bit differently if you have, like, numeric data. But that aside, a vector database is, I believe, still sort of focused on representing the entity as something, whereas we're focused on sort of breaking the entity down, scattering it across, you know, a whole bunch of different places in memory and on disk because every every distinct value of a column is sort of addressable. But to put a record back together, you have to kind of go look at all those different values and reconstruct things. And that's why earlier on, I said it's a trade off between transactional and analytical because for transactional workloads, you're really interested in getting the whole record back. For analytical workloads, you're generally not. You're interested in answering aggregate questions about particular columns or particular values in the dataset.
[00:25:54] Unknown:
That's the fundamental trade off. Yeah. It's definitely interesting. And digging deeper into that data modeling question, I'm wondering what you see as some of the core concepts that people who are using feature based need to understand in order to be able to make proper use of it and how they need to think about the specific attributes that they want to decompose into those bitmaps and how to think about, you know, what are the aspects that we want to be joinable, what are the things that we only care about in isolation, and some of the ways to convert their existing, maybe relational data or, you know, semi structured or unstructured data into the representation that feature base operates best on?
[00:26:39] Unknown:
I'll say again, like, our ultimate goal is you don't have to think about any of this. That's not entirely true today, though. It does behoove you to think about how things actually look, how things are represented under the hood. Now, for the most part, you can ingest your data as normal. Like, I mean, we're now supporting, like, bulk insert SQL statements where, you know, you can just jam, you know, a huge number of records in and we take care of all the under the hood machinery to transform that into bitmaps and so on and so forth. But if you do think about how it ends up being represented under the hood, especially with regard to set fields, like, that's the main thing where if you'd have a relational schema where you have something that is tracking, you know, a set of values being associated with a particular record in a particular column, you definitely wanna use a set field for that. You do you don't wanna try to have a many to many relation and have a separate table.
And there's actually built on top of that even. There's something called a time quantum field that gives you that set functionality and also gives you the ability to associate a time stamp. Of course, great time stamp can be down to hourly at most, but it gives you the ability to associate a time stamp with each value that's associated with each record. So you can actually say, like, so and so listened to episode 17 of Tobias' podcast on a particular day or at a particular hour. And then they actually listen to that same episode again the next day, you know, at a different hour. And you can track that all within the same column, within the same table of your schema from a storage perspective without sort of going outside of that.
That can be really, really powerful because you're adding that time component to the set field. You know, the set field sort of being you that natural power to represent multiple values without any overhead. Now you can have multiple values and each value has its own course grade time stamp associated with it. And you can query across those and say, you know, give me how many people listen to episode 17 of my show in February. You can ask questions like that and then operate on that set of people. So I think those are the things you need to think most about when you're, you know, you take a traditional relational schema or whatever you're doing in another database and you want to move it to feature base.
So those set fields and then the time quantum fields on top of that are the most important things to understand. There is some support to do joins, and basically you can have 1 field represents a foreign key. And so we would store that as an integer. And we have a really interesting way of storing integers in bitmaps called bit sliced indexing, which I won't try to explain right now. But but basically if you have you have integers of any range, like, you know, you can have 64 bit integers. We basically encode them in 64 different bitmaps instead of the typical bitmap indexing. You you have to have a bitmap for every unique value. But if you have 2 to the 64 bitmaps, like, you're gonna run out of memory. I'm sorry. So you can't do that. But what you can do is what's called bit sliced indexing where we break out each binary digit of the number and store it in a separate bitmap. And it turns out we can reconstruct arbitrary range queries across those in in a pretty efficient way.
So that's how we hold basically the foreign key references. There's a special query in the storage engine that will take things that are stored in a bit sliced index, and you can say, like, give me all the unique values in this field as a bitmap. So I'm taking the values that are stored in this column, in this bit sliced index column, and getting them out as a bitmap. And that bitmap will be applicable to whatever table that foreign key references to. So I can go use it as a filter on that table and union it and intersect it with the bitmaps of that table. That's how joins work. Under the hood essentially is through these foreign key references. So you do still have that capability, but what we most often find is that with set fields and time fields, you have to do a lot fewer joins. And that's where the real performance benefit comes from.
[00:30:56] Unknown:
You mentioned that the overarching goal is that as you're loading data into feature base, you don't have to think about how to actually convert your existing representations into what feature base actually wants. And so I'm curious if you can talk to some of the ingest path and some of the surrounding tooling that you're building to be able to make that a more seamless experience and so that people can just throw a feature base at the problem, get the performance and analytical capabilities that they want without having to do a bunch of planning and extra engineering effort.
[00:31:32] Unknown:
Yes. Yeah. Absolutely. So if you're familiar with, you know, for Pelosa from back in the day, if any of your listeners ever use that, you basically had to set up a separate process that was running like a in a separate binary. And it was gonna pull either from CSV files or from a SQL database or from Kafka. And it would do a lot of the heavy lifting of data transformation before sending that data in sort of a bitmapified form up to the feature based cluster. We still have that pattern under the hood because it helps offload a lot of load from from the main cluster, which is really nice to sort of be able to scale your ingest workload from your query workload independently. However, with the SQL engine that we've been building, we are exposing a lot of that same functionality through SQL. And in the Cloud product, This will sort of seamlessly just say, like, you know, if you're doing an insert statement, the query planner will know that a lot of this data can sort of be shoved off into your ingest workers that are running all the data transformation components and your queries can be passed through to your compute nodes that have all the data and are processing doing query processing.
But even in just a standard, you know, feature based deployment, the ability to just insert some data, you know, at the command line, you know, just to see how it works. I mean, it can't be overstated how nice that is just for getting started and and figuring things out rather than having to mess with a separate tool and figure out, oh, I need to use, like, the CSV version of this tool to ingest these CSV files and I have to get the headers just right and to, you know, get all the mappings. Now it just looks like a SQL insert statement who you used to. Right? You set up your schema, you say insert into, here's all the data and and you can even do some inline transformations. It's something we've got coming up, which I think is gonna be really, really nice for people.
You get this kind of, you know, to excuse the cliche, a single pane of glass. Right? Everything goes through SQL, and it sort of just figures out under the hood what it needs to do with that. And because SQL is, you know, a language and we have control of the whole parser and can sort of add whatever we want to it, we can add arbitrary capabilities to ingest right in there. You know, data transformations, mappings, you know, casting things, you know, 2 fields combining them into 1, you know, whatever you really wanna do. Having that all go through that 1 interface and then for us to be able to, on the back end, like, optimize it as we see fit, split things out, you know, move things where they need to go, I think it's gonna be really, really nice.
[00:34:08] Unknown:
The mention of SQL also brings up the end user experience of working with feature base. And I know that there is a PQL language that has been supported and that the SQL interface is a newer addition. I'm curious if you can talk to some of the kind of user experience. I don't know if design is the right word, but some of the efforts that you've been putting into the user experience aspect of working with feature based to make it more approachable without having to do a bunch of custom training or, you know, self directed learning to try and figure out how do I actually, you know, take this thing that seems really interesting and make it fit into the box that I understand.
[00:34:48] Unknown:
Yeah. I you know, SQL is just a huge part of that because even if even if you don't know SQL, like, you're familiar with it. Right? Like, it just comes up everywhere. And so having those common abstractions, we can just say, like, hey, you run this client, tell it where feature base is, and it's gonna give you a command prompt that you can type SQL into it and you can basically manage all your interactions from there. Or, you know, if you're interacting programmatically, you're just making HTTP requests and sending SQL strings.
Like, anybody can do that in any programming language. It just greatly, greatly simplifies things from, you know, what we had in very early days, PQL, which at that time stood for Pelosa query language, was a language that basically mapped directly to what was available in the storage engine. Right? It's like it was like take this bitmap and intersect it with this bitmap and count the number of set bits in the output. And of course, you know, we've known since the eighties or before that having your the language in which you talk about your data be sort of abstract and separate and, you know, from the data representation is a really powerful and important property.
So it was clear early on that we were going to have to move away from pqual, which is basically the assembly language of the database. Right? It sort of maps directly to what's available at the storage layer and move up into a declarative, sort of more abstracted language that allows you to just talk about your data agnostic of the underlying representation. So SQL was the obvious choice to do that because it's already out there. Everybody knows it. It has its warts. No question. But we don't feel the need to try to define an alternative query language, you know, that solves all the problems, you know, along with everything else that we're doing. SQL ultimately will get the job done and with the added benefit of people already know how to use it and, you know, where to expect things to be kinda weird. Like, oh, like, how do you handle nulls and, you know, that kind of thing. So I think it really is all about making that 1 consistent interface, using that everywhere, because in the past we've had like, oh, we we expose the gRPC interface and we actually mimic the Postgres wire protocol. So you can use a Postgres client to talk to feature base and send people or a subset of SQL over that. So we had all these, like, really cool projects where we explored a lot of different things.
But ultimately, if you wanna build a cloud product and you want it to kinda work everywhere, you need to like, you know, there's there's middleware, there's proxies, there's all kinds of things out there on the Internet. And HTTP is is kind of universally supported. The authentication mechanisms are well known and well understood. You just use HTTP and you send SQL over it. And that is a sort of a universal interface that that everybody can kinda grasp on to and understand. And then we can focus on explaining the parts that are actually need to be explained, like the set fields and the time fields and, you know, the things that are sort of fundamentally different at the storage level that you wanna be able to take advantage of.
[00:37:54] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudder. As far as the kind of security elements of it, given the fact that you are building a cloud product and that you're also investing in a serverless approach, particularly interested in understanding some of the ways that the underlying storage and the ways that you've architected the engine is reflected in the security model and how you think about multi tenancy and scalability of the engine to be able to support that cloud product approach?
[00:38:53] Unknown:
Yeah. Oh, man. There's so much to unpack here. So all the basics that you'd expect, like, you know, you you sign in, you get a token, and everything's encrypted. All the network communications are encrypted and so forth. Where I think it starts to get really interesting is what sort of level of granularity we can expose in terms of permissions and controls? Because there have been some databases that came out, you know, in the last decade or so, especially in the intelligence community that, you know, they're 1 of their big selling points is is like cell level access control. Right? Like you can grant individuals access to particular columns of particular records at a really, really granular level. That's kind of a cool useful feature, but it also kind of has a high performance cost potentially.
I think with the way things are represented in feature base, we we actually have an opportunity to do this in a really efficient way. Now we haven't done a lot of this yet, but you can think about, okay, so if I wanna give someone access to a subset of the records in a table, that's just a bitmap filter that I'm gonna apply to all their queries on that table. That's obnoxiously fast. Right? There's almost no overhead to do that. So I just have to store that bitmap for that person with and probably, like, a hidden field on that table. And, you know, it says, like, this person has access to these records, and that gets basically added to all their queries by the query planner.
If you wanna give act someone access to just certain fields, that's pretty easy too. At a high level, you store that metadata about what fields they have access to. And because, you know, like a column or database, everything is sort of broken out. You know, if they want to do a big like a backup or a dump of data, it's pretty easy to just read the columns that they're interested in. You don't you don't have to go through and, like, filter out at a granular level what columns you can export for this person. You just go export the columns in question. It's not like you have to scan a table and remove them.
So that's really nice. 1 thing other databases could do everything I've talked about so far. I think the 1 thing that feature base could do uniquely is that we store the bitmap data separate from the key data. And by the key data, I mean, what string does each bitmap map to or what record does each position in the bitmap map to? So you can imagine over here, we've got this big compressed bit matrix bit bit matrix. It's just a bunch of, you know, zeros and ones. And then over here, we say, here's what each row in that matrix maps to, You know, skiing and Lego building and, you know, whatever. And over here, we store here's what each column maps to. So let's say you wanted to run some clustering algorithm across your whole dataset. You know, you don't wanna do some really intensive machine learning. You wanna pick out clusters in your data. It's quadratic or maybe even worse than that. So you don't wanna do it locally where you're running. You wanna ship this up to, you know, the cloud where you have elastic compute resources available to you.
You can just, in theory, just send up the bitmap data without sending any of the keys that go along with it. So you're exposing far less information, you know, potentially if there's some breach in the cloud environment or whatever. You're only exposing this big old matrix of bits that that no 1 has any idea, you know, notionally about what those actually map to rather than just exposing the whole dataset. Not to mention, you know, you use a lot less bandwidth because you're you're only sending these compressed bitmaps rather than the sort of key value translation stores. I think that's an area where, again, you know, our problem is always focused. Right? We have to decide what to focus on and where to spend our resources because it turns out building a database is a lot of work. But I think that is an area where we could have some really cool, like, product opportunities.
[00:43:00] Unknown:
Another aspect of what you're building that happens no matter what type of data you're working with is the question of schema evolution, where with relational engines, you can add a column or alter a column or create a new table. But because of the fact that feature based represents every kind of attribute as its own discrete kind of unit of data and the associated bitmaps. I'm curious how that plays into that question of schema evolution and the ways that the data changes as you progress through kind of evolutions of the products and the information that you're trying to represent and work with?
[00:43:37] Unknown:
Adding and removing columns is actually really easy because, I mean, just like a column or database, all the columns are stored separately. So it's really no problem to add or remove a column. Now if you want to modify, you know, a column like you wanna represent your integer differently or something. A lot of that stuff is notionally possible and will have no more overhead than it does in your typical relational database. But we haven't implemented a lot of it. So there's a few, like, alter column type things that are supported. We recently added the the ability to add a TTL to a time column so you can sort of age off old data automatically.
You can turn that on and off or or tweak it. But for the most part, the way that our customers are evolving their schemas is through adding and removing whole columns right now, which has actually worked out okay. It it hasn't been sort of a major sticking point, but I'm sure that we'll get more demand for being able to alter columns and, you know, sort of evolve the schema in arbitrary ways in the future. I don't think there's any, like, fundamental reasons why, you know, a different database would be better or worse than a lot of these things. A lot of these are just they're really like heavy data transformation operations that you just have to sort of support and and you try to support them in a way where the database stays live, you know, while it's happening. So you're not you're not, like, overwriting the data in place. You know, you make a copy, you make some changes, and then you flip over when it's done. I think all of these things are doable. It's it's just a matter of implementation time and and, you know, deciding that it's worth it to put our resources there and actually expose those capabilities.
[00:45:26] Unknown:
As far as your experience of working on feature base and evolving the project and the product and keeping an eye on the broader data ecosystem and how your database engine is being used, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:45:42] Unknown:
You know, I think 1 that came up a little while ago was some folks were they were doing kind of it was kind of a security use case where they were taking lots of different application binaries and breaking them up in various ways and hashing parts of them and, you know, like, you know, breaking out different parts of the binary and then sort of storing all these different hashes and things as features in feature base. Features just sort of being distinct values of a column, but but you can you can see where some of the naming comes from there. And then using that to sort of very quickly and at scale detect whether some new binary or some new artifact in somebody's system might have malware associated with it and sort of assign it a score. If that explanation sounds kind of like a little fuzzy and hand wavy, it's because I don't actually fully understand what they're doing. And that's actually 1 of the most exciting parts about it to me is you're getting value out of the software we developed in a way that I don't fully understand. Like, that's really cool. I think, you know, that speaks to, you know, this thing having broader applicability than than sort of what it was born from and where it came from, which is really exciting to me.
[00:46:58] Unknown:
In your work of helping to build and evolve the engine and starting to build out this cloud product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I knew that
[00:47:10] Unknown:
building a database was was a huge amount of work. I mean, you see it quoted like, you know, databases, a 100 engineer years to mature a database or and you see that kind of thing thrown around. What I didn't realize is that I think a lot of those engineer years are spent working on CI and and, you know, testing and getting your tests to run reliably and testing at scale and making sure that you clean up your scale tests after they're done so you're not spending a fortune in the cloud and getting your test to run reliably and not, you know, be sort of flaky and fail. I feel like there's an old quote from someone who, you know, they it's like 1 of the first bugs that was found and some someone realized that they were going to spend a significant fraction of the rest of their life searching for bugs in their own programs. You know, this is back in like the forties or something. And I was like, I think I just realized recently I'm going to spend a significant fraction of my life like dealing with CI. Right. Because you have to have CI, you have to have really, really good test coverage.
Anything that's not tested in CI, you can assume it doesn't work, basically. That's kind of my motto. Like, if it's not in CI, it doesn't exist because things will break arbitrarily. You know, when when you've got a dozen developers or more working on a product, changing things, adding features, You can't watch everything all the time. If you want to make sure that something is going to work and nobody's going to break it, you better have a test for it and it better run in CI. And then you better go and like look at CI from time to time and make sure that it's not failing silently because because that'll happen too. So maybe not the most exciting answer, but, yeah, it's just really, really important and really, really hard and time consuming to get right. And I think that's where a lot of
[00:48:57] Unknown:
the efforts in building a system like this need to be focused. Absolutely. Yeah. Particularly for a a storage engine where people are trusting it to be 1 of the most critical elements of their infrastructure. Because if your web server goes down, well, you just spin up a new 1. But if your database goes down, if you don't have backups, then you're out of luck and you just threw away, you know, however much time of your company you spend in collecting all that information.
[00:49:23] Unknown:
You gotta have backups. You gotta test your backup. You gotta test your restoreroutines. And databases, I think, are they're particularly in need of testing, and they're particularly difficult to test because you you need to have giant, like, fixtures of data to be able to shove into them, and then you need to, like, know what the right answers are for queries, and you have to have those stored. And, yeah, there's a lot to do. Absolutely.
[00:49:49] Unknown:
And for people who are interested in trying to speed up their analytics and be able to, you know, query across larger datasets? What are the cases where a feature base is the wrong choice?
[00:50:02] Unknown:
I like the way you asked that question because it sort of precludes someone using it for a transactional workload, which you definitely should not do. I used to in demos, I would do, like, you know, select star limit 1 type query, and then I would do, you know, like a count where a bunch of different things. And, you know, the select star takes, like, 75 milliseconds to run and the count across the whole dataset takes like 3 milliseconds. I'm like, if if your primary use case is like you need to, like, reconstruct whole records and get individual records back, you're not gonna have a good time with feature base. If your use case does a lot of filtering and aggregates, you need 4 things. You need low latency queries. Right? So either because you're an impatient person or because you're powering a web page, you know, that you want to be very responsive.
So you want low latency queries. You want fresh data. So you don't want to be querying data that's a day old that you ran in an overnight batch routine. You want it sort of live ingested. That's something we've spent a ton of engineering resources to have that capability and feature base. We're sort of micro batching incoming data and applying it live. And so, you know, within a second, you know, of data being generated, it's available for query in the database. So you want low latency queries. You want fresh data. Potentially, you need high concurrency queries. So you've got potentially lots of users using the system.
You know, I think like typical analytics use cases are like business intelligence type things, you know, where you might have some like 1 analyst poking around at a, you know, a SQL prompt. This is more like you're powering an application with this thing that you're exposing to a broader audience. So you have more query concurrency coming into it. So on the query side, low latency queries, high concurrency queries. On the ingest side, very fresh data. So that's sort of latency on the ingest side, you know, freshness. Your data is available as it's coming in and high throughput data. So you can you know, maybe you're generating, you know, 500, 000 records per second and you want those to be freshly available for analysis.
If you need all those things, I think you definitely need feature based. If you need, like, 2 or 3 out of 4, you might wanna look at feature based. If you don't need those things or if you're doing transactional workloads or, you know, if you don't have 1, 000, 000, 000 and 1, 000, 000, 000 of records, you know, right today, feature base is probably more trouble than it's worth in terms of, you know, the operational overhead and the maturity of the tool and everything. Come back in a year or 2, I think it's gonna be just as easy to use as, you know, most other databases and then, you know, maybe it maybe it just makes sense anyway. But, yeah, I think it's about those 4 things. The latency and throughput on the ingest side and the latency and concurrency on the query side.
If you're doing analytics and you need to have that sort of need to hit all 4 of those areas, I think you should take a very close look at future base.
[00:53:08] Unknown:
As you continue to build out the open source project and the cloud offering, what are some of the things you have planned for the near to medium term, either for the core engine or just for the ecosystem around it to improve the user experience?
[00:53:23] Unknown:
In terms of the user experience, it's basically all about SQL. Getting more SQL support, getting things optimized such that, you know, that, you know, it's not just technically available in SQL. It actually, like, works at a reasonable speed and everything. That's really the main thing on the user experience side. I guess the main other thing on the road map is the serverless stuff, and that will definitely affect the user experience because you can sort of stop thinking about your feature based deployment as this, like, cluster of nodes and just start thinking about it as, like, I've got some databases in here. And if I add more data, like, I don't have to think about anything. Right? And on the back end, you know, we're running it in Kubernetes or ECS or whatever, and we're adding new containers and rebalancing data as needed.
And our hope is that we don't expose too much of that. I mean, we'll expose whatever we need to. So if somebody wants to say, like, hey. You know, I want some dedicated resources and I want to scale them up to x y z because I know I've got a big event happening. You know, I think we probably will end up exposing some amount of that, but we'd love for it to just be a pretty seamless experience where you where you think of it like s 3. Right? Like, you you don't think about scaling up your your s 3 deployment. You just shove a bunch of stuff in there and don't worry about it. And, you know, it'll be there when you need to get it out.
[00:54:44] Unknown:
Are there any other aspects of the work that you're doing on feature base and the database engine and the products you're building around it that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:55] Unknown:
So we can talk about set fields and time fields, which I think are the most important from a, you know, kind of usage perspective. I think the thing I'm most excited about right now is definitely the serverless stuff, and that's gonna be launching to the public probably early next year to where you can actually get on to feature based.com and and spin up a serverless. I think that's just gonna be really cool and and really transformative. You know, where you can get on there, bulk insert a bunch of data and just not even think about the resources that are behind it. I can't tell you how many conversations I've had about, like, how do I size my cluster? Because when you can't easily scale it up and down, like, that that's really important. Like, what types of nodes do I pick? How much memory versus CPU? And for that all to kinda just go away and for it to just dynamically rebalance in the background is gonna be so huge. And and there's a lot of interesting, you know, architecture and computer science problems under the hood that we get to work on. I think other than that, I think we covered a lot of really good ground here. Absolutely.
[00:55:55] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or even contribute to the engine given that it's open source, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:14] Unknown:
I wanna give, like, a meta answer to this question. I think there's a lot of gaps and a lot of things that need to improve. I think the thing that will improve all of them the most are gonna be better programming languages. Everything kind of stems from that. Like the whole the iteration speed for developers to improve things stems a lot from the programing language that you choose. We use Go. I love Go. It has plenty of warts, and I'm not convinced that any of the modern languages in use today have, like, solved all the problems, you know, which is a silly thing to say. Nobody's going to solve all the problems for a while. But I think there's just there's still a ton of benefits to be had from improving programming languages and development environments, specifically with a focus on iteration speed. Like, the more you can decrease your cycle time for testing changes and tweaking things and playing with things, I think the better off the world will be, you know, in terms of the speed that we can create new things and improve all other aspects. So that's just kinda where my head's been at lately.
Sorry to take that question a little bit off the rails, but I think programming languages are vitally important and it's actually difficult to see, you know, they really only get worked on if they're sponsored by, like, a huge company. Right? Because it's hard to see how you would make money, you know, building a programming language as a business. That's something to look at and think about in the future is how those sort of very fundamental tools are going to evolve because I think they can have a huge impact. Absolutely.
[00:57:49] Unknown:
So for 1, there are no rails to this question. That's the point of it. But I definitely appreciate that perspective because, yeah, as you said, the programming language, as with all language, really shapes the ways that you're able to think about given problems. And for the problems that we're starting to address, we need to start coming up with new ways to think about them, particularly as we move into new areas of architecture and infrastructure and problems that are trying to be solved that haven't been solved yet or haven't been solved well. So definitely appreciate that perspective. So thank you again for taking the time today to join me and share the work that you and your team are doing at FeatureBase. Definitely a very interesting project. Excited to see some of the directions that you're taking it, and I hope you enjoy the rest of your day. Hey. You too. Thank you so much for having me, Tobias. This has been a lot of fun.
[00:58:44] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Matt Jaffe: Introduction and Background
Overview of FeatureBase and Its Use Cases
Evolution of FeatureBase and Cloud Product
Impact of AI and Machine Learning on Data Management
Comparison with Vector Databases
Data Modeling and Schema Evolution in FeatureBase
Security and Multi-Tenancy in FeatureBase
Interesting Use Cases and Lessons Learned
When Not to Use FeatureBase
Future Plans for FeatureBase
Closing Remarks and Final Thoughts