Summary
Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what constitutes a NoSQL database?
- How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago?
- What are the factors that convince teams to use a NoSQL vs. SQL database?
- NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus?
- How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines?
- When designing and building a database, what are the initial set of questions that need to be answered?
- How many "core capabilities" can you reasonably design around before they conflict with each other?
- How have you approached the evolution of RavenDB as you add new capabilities and mature the project?
- What are some of the early decisions that had to be unwound to enable new capabilities?
- If you were to start from scratch today, what database would you build?
- What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB?
- When is a NoSQL database/RavenDB the wrong choice?
- What do you have planned for the future of RavenDB?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- RavenDB
- RSS
- Object Relational Mapper (ORM)
- Relational Database
- NoSQL
- CouchDB
- Navigational Database
- MongoDB
- Redis
- Neo4J
- Cassandra
- Column-Family
- SQLite
- LevelDB
- Firebird DB
- fsync
- Esent DB?
- KNN == K-Nearest Neighbors
- RocksDB
- C# Language
- ASP.NET
- QUIC
- Dynamo Paper
- Database Internals book (affiliate link)
- Designing Data Intensive Applications book (affiliate link)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey, and today, I'm interviewing Oren Aine about the work of designing and building a NoSQL database engine. So, Oren, can you start by introducing yourself?
[00:01:49] Unknown:
Yes. So I'm Owen. I'm the CEO of 11 DB. I've been working on databases for about 20 plus years. I've been working on revenue specifically since 2007 now, so over 15 years at this point. RevenueDB is a document database. It's written in dot net and run on everything from the Raspberry Pi, Linux, Windows, Cloud, and Edge deployments, and that's about it. That's the elevator speech.
[00:02:22] Unknown:
And do you remember how you first got started working in the data space?
[00:02:28] Unknown:
Completely accidentally. I was looking into building some, I don't know, some business application, something that I wanted to do for myself. I think it was a feed reader at the time where a RSS was killed, and I needed to persist some data and started using an object relational mapper. And then I got it really interesting in how it works, what's going on, And eventually I end up being a contributor to the project, for the Inhabitable project in that, and that has been my life for about 5 or 10 years where I got really deep into that. How do the best work, how they're operating on, how do you operate over data, how it's reflected to your application, etcetera.
It was also a time that was incredibly frustrating to me because I understood how that they spoke, but I kept getting queries from people and request for assistance that did some pretty horrible stuff. I I I love saying that at 1 point, it it went to to a customer, and I consulted on a performance problem that they had. And it appears that they were running over 17, 000 queries to load a single page. And you look at it and says, I I don't even know how to even start telling about it. It it, that that there is a statement that added, this is not even wrong. This is just something completely different.
And so I did it for a while, and at some point I got really frustrated with seeing the same sort of errors over and over again. So I decided that I'm going to do something about it, and I wrote an automated tool that would basically look at the interactions between your application and the database and analyze what's going on. Now a profiler for the other basis is absolutely not something new. But my point was I don't want to try to analyze a specific query, or you need indexes, you need indexes. I want to to analyze the entire interaction between your application and the database. Seeing how, oh, you load this data multiple times, you're talking too much database, those sort of things.
Relatively simple things all told, but the amazing thing about it is that the amount of things that it found and the level of impact it had will was huge. And that led me, even into more deep holes of people who got burned working on a complicated data projects. At that time, we're talking relational business, almost the only game in town. No secret database were just, getting started. And I remember looking at CouchDB and being going, wow. Like, I really loved that it did. It clicked. It made sense. And then I looked into what it actually took to run it, and it was somewhere slightly beyond black magic and blood sacrifices.
You could do that, you could get it working, and if you got the planets to align just right, it would even work. And then, you know, you're in a position where you're juggling knives, and it's absolutely amazing until you sleep and then you have to pick up your fingers from the floor using your own teeth. And that was I remember being really upset about that. It was like, here is something that's funny with it, and I cannot make use of that. So I had this idea, here is this database that would be perfect. And it's fun because I've been now working on databases for a lot of time from the perspective of the person who build database engines.
And what ended up happening is that I sat down and thought about how database should look like from the other side because I post primarily then a application developer. So what are the sort of features that I want? And some of them are so stupid. They're they're so so so stupid and stupid things that they drive me crazy. 1 of them, for example, let's say that you want to show a page list. What do you need in order to actually render a page list to the user? You want so give me the 1st page of the items and give me the total count so I can show you how many items you have there. It turns out that this is incredibly expensive.
Why is that? Because you have to actually execute the query to run all of the, the results and, solve them, get the right value, and then you have to do it all over again just to get account. So select top 10 form and then select count form. And that's actually it sounds silly, but that's let me drop you down to what actually happened to you. On a false hand, let's say that you have a non trivial query. So you have to execute this non trivial query twice. The other thing is that, let's say that you are in on the cloud over network database or something like that, and you have network latency talking to the database, even if, you know, just 2 to 3 milliseconds. Now you're spending 6 milliseconds just on network traffic.
And when I designed 11 dp, 1 of the things that I did was, oh, here is this small feature. Let's ensure that whenever you ask for for the 1st page of the results or for a page of results, I will also give you the total count already there. There is no extra cost that you have to do there. So okay. I saved the query, which is nice. I also saved in a call country. And this is, like, this tiny, small detail. Oh, I don't need to worry about this anymore. Wow. Yeah. So it just
[00:08:50] Unknown:
went up. As as you said, there are a lot of things that seem silly as the user of why is this so hard, and then you start digging into the implementation details, and you say, oh, it it makes sense that as to why it's so hard in this case, but it shouldn't have to be that way. You mentioned the juxtaposition of relational database engines and NoSQL engines. And for purposes of this conversation, we want to talk through what it means to actually design and build a NoSQL database. But before we can get into that, I'm wondering if you can just first give your definition of what constitutes a NoSQL database.
[00:09:27] Unknown:
I would want to correct something, not NoSQL, but non relational. And the reason for that, SQL as a query language has absolutely 1 today. There is, I believe that the only database right now, only popular database right now that doesn't have a SQL variant is MongoDB, and I don't know why they do that. They still do that. That's it's not complicated and it's so much easier. But if we look at relation versus nonrelation databases and go back, you know, 40 50 years, then it used to be database wars. The the the there's used to be this scenario where, you had to select which database you want, and then and there was navigation database and front side database and all sorts of things like that. And relational business is just 1 day. And a 1 today is so conclusively that in many respects databases today are relational database, and if you want to say, I'm in non relational database, you have to explicitly say that.
And that lasted for close to 30 years, which is an amazing time frame, especially when we start talking about the speed and velocity of software development. And it turns out that 2 major things happen to invalidate or shake the the the foundation of the database world. And 1 of them was Internet the Internet and the working in distributed environment, and the other thing was the amount and the scale that we worked on. And it's funny because to a large extent, a lot of that was actually driven by hardware problems. If I'm going to the early Google or any of the big web properties of the early 2000, 1 of the primary issue that they have to deal with in terms of data sizes was physical limitation of the hardware. You're using HDDs, how this drives. And if you had a really good 1, you had 10, 000 RPMs, which means that at the best case scenario, you have low hundreds of times that you can read or write to the disk, which is, okay, if I want to support more than, you know, 500 reads per second from the disk, I can literally not do that. I have to get 2 disks.
And that led to, basically splitting the data across multiple machines, commonly hard on something like that. Which then led to a really interesting problem. Now you're running in a distributed environment with everything that this entails. So let's say that, the simplest number I can think of, let's say, that they have employees that are spread over multiple or user do multiple nodes in a shared database. And I want to ensure that they have a unique user user name. Well, how do I do that if I have to deal with some part of my system may be unavailable or something like that? So, how do we handle phone keys in a distributed environment? Those are all really hard problems.
And another aspect that we have to do with is amount of complexity that you look at in terms of the data. If we're talking about something that used to be very simple, let's talk about shopping cart. Well, shopping cart used to be here is the name of the person, and the, here is the product that they purchased and maybe their their amount. Not nothing really that interesting. But it still means that if you want to render the shopping cart for years, so you have to touch 3 or 4 tables. Okay. That's reasonable, except that if you now have you don't just have a shopping cart. You're not just the items. Here the offers and sales and recommendation here to other people bought, and did you mean to buy this or that.
For each 1 of the products, we also have to run analysis on allergens, and this is organic or non organics, and vegan or not vegan, whole bunch of other stuff like that. So the amount of complaints that they have to deal with is really high. Now let's talk about how do I, as a user, want to be a a look at my shopping cart and see, you know, does it contain gluten, yes or not, or any 1 of my items? Well, that means if I'm working on the relation database, 2, 3 additional queries per item in the shopping cart just because I'm using a relation model and and the data is spread across multiple tables. Leaving aside things about scale and size and everything like that. So the cost of complexity of building system using the national systems increase in a very rapid fashion, and that is something that is very easy to ignore until you go from your development machine where you have small amount of data, single user, and not a lot of traffic, and you push it to production, suddenly you look at your database and, oh, I'm doing a 100, 000 queries per second, and I don't even understand why. But, you know, my Amazon RDS bill is through the roof because of that.
Nanulation databases, and we have everything from key value, like, Redis or document, base, like 11 DB to graph databases in Neo 4 j. They take very different models into how you operate them. So you also have databases such as Cassandra, other, column family column family databases where the amount of data that you have is stupendous in the hundreds of terabytes, 100 of petabytes, then you realize that, okay, you're now back to false principles. My data is spread over 2 or 300 servers, and I have to be very careful about how I orchestrate and model things because I cannot do just, you know, random query over a petabyte of information. Not if I expect to be able to return results within, you know, a 100, 200 milliseconds to the user. Today, and I'm talking about today versus what we had, you know, 15 or 20 years ago, I think that the primary difference between relational and relational databases is in the way that you model the data, whatever this is in tables, but, of course, many different locations, and the sort of operations that you can run over this data and the distribution and management model.
This is really interesting because if I'm looking into document databases, this is near and dear to my heart, obviously. I've been working on this for a long while. Then you can look into okay. At the core, a document database is just, okay, here is a bunch of JSON file, XML file, or something like that. Well, I remember using SQL Server 2, 005, it had XML columns. I made an extensive usage of that back in the day, and right now, I I don't believe there is any relation database that doesn't have some support for doing, JSON values. So what's the difference between those?
Well, the difference is between, oh, I have some JSON text or JSON values that I can operate on to some level and I have but still the entire system is based around, rows and columns. For example, if I have a if I go I want to store user document in a JSON value inside of Postgres, then trying to ensure that they have a unique username inside the JSON value is not something that you want to to tackle. Certainly, it's I don't know if it's possible. Last I checked it was not, but I'm not an expert in that. But, even if it is, it's really, really complicated.
Defining things like indexes or negation over those values is also pretty complicated and there's another aspect that is almost accidental here. A relational database almost by definition means that you are in 1 server. If you want in on oh, I want to want a cluster of servers, then, you know, in all cases, it's going to be, oh, I have a primary and secondaries and the secondaries are only used for read replicas. And if I'm really, really smart about it, I'm going to be I'm going to do a automatic credential action. But this is not what but relationalists are very strongly focused on there is a single source of truth and it moves between those each and every time. If you're trying to deal with network partitions or any sort of items that you want into production, it's very easy to get yourself into really bad situation. I have a really good example here, which is GitHub in 2018.
They were using, I believe, MySQL at the time, and they were running in multiple cluster in the West and East Coast, and they had a primary node, that accepted all of the rights. And at some point, they have a 43 seconds router fail, which trigger a a automatic leader election and move the primary to a different node. I think it was actually went to the to the West Coast data center, sort of the East Coast data center, something like that. And suddenly, you heard about that was a network partition. Some other nodes in the system did not get this notice. So what actually happened was that he had nodes that were writing to different primaries at the same time.
So we are talking about 43 seconds of network partition, and that led to GitHub being down for over 24 hours, which and the the postmortem that they they publish afterward reads like a like a thriller, basically, in the sense of, oh, we have to do this. We have to do this. Eventually, they had to reconcile transactions by hand, as insane as that sucks. So when we talk about relational relational databases, I think that the way that you work with and the manner in which you operate are very different. If I'm looking at Remedyb specifically, then a lot of the operations are explicitly meant to be the the idea here is the database is a grown up. You don't need a babysitter.
You don't need someone to continuously monitor and wipe its nose and something like that. A great example that they have here is that query planning in databases is complicated. Complicated to the level of multiple PhDs in, in that area over decades of work. So this is not not nearly done. And as a user of a database, you're expected to be able to make a good use of that and define the proper indexes and all of the trace of phones and something like that. And if you change your query to the most miniscule degree, well, it may use a different query plan, and suddenly your entire system is down.
Whereas with Raven, for example, 1 of the assumptions that we make is that I don't have a babysitter. I don't have a fully trained DBA on call to run over everything. So you wanna query and there are 2 responsibility for the call optimizer instead of MongoDB. 1 of them is to actually generate the optimal query plan, but this is what every database tries to do. But the other 1 is to detect when I don't have a good query plan. And then what I want to do, I want to be able to generate the appropriate set of indexes and operation, change the structure of the database itself in order to handle that. And that leads me to something that that that is funny. When you talk about relational databases, people tend to think about the actual relation, tables and columns, but there is a much bigger problem in the way that you even think about the the operation. Let's talk about something as simple as a shopping cart, and let's think about how it, how it works. I have a shopping cart. I want to add an item to the shopping cart.
So I look at the product page, I click add, then, it goes to the some API endpoint. It loads my shopping cart, edit if the value is already in the shopping cart or create a new line items or increment the line item if it's already done. And, okay, that's great. But then I move to actually going ahead and purchasing that order and purchase something out, etcetera. But if I'm looking at the actual interaction between my application and the database, it turns out the situation is a lot more complex because I have this API endpoint that got this add a new item to the shopping cart, and now my API starts to have a chat with the with the the database.
Can you give me this? What about this? I want you to modify, this, and eventually go ahead and commit the transaction. And remember what I said earlier about network bandwidth being a meaningful impact. Relational database will design at a time where things was a lot slower. The number of people were a a a number of reasons database were a a mini school. The amount of data that we work was less. And most of the time, we were working in the same location. So the network cost of going over the network will tend to be far less costly.
So you had this idea of, okay, I'm going to begin a transaction, ask the database a question, wait for the user to actually look into that, and then make decision to find I'm going to commit everything. But this is not how we how we do things these days. We are doing that in distinct distinct calls using a request response API. So there's no underlying transaction that encompass my process of purchase something from a store or something like that. It's a set of discrete operations. Now, when it comes to translate that operation, the the idea of having a chat with the database and doing this back and forth versus the model that RevenueB uses, which is the batch transaction model where the client side is going to aggregate all of the changes that you want to to do and then send all of them to the database in 1 shot means that you get a vastly reduced amount of network traffic.
I think that looking back, 1 of the biggest and most important design decision that we made was to decide what sort of guarantees do I want to provide to my user. And by being able to reduce the sort of guarantees that they need to do, I was able to dramatically simplify the overall design and implementation of the database while being able to offer a a lot more features to the end user.
[00:25:45] Unknown:
Those definitely a great overview of all of the different considerations in non relational engines. And I think another interesting aspect of this space is to your point of most of these engines do have a SQL interface now, whereas at the outset when non relational engines first became a popular choice. They've been around for decades, but they really started to take off as a class of database engines around the 2009, 2010 time frame, I think. You know, sure people debate the the timing a little bit, but roughly in that, in that area. And I'm curious how you have seen the user requirements around non relational engines change over that time as they started to realize that maybe they can have their cake and eat it too and not just say, oh, well, if I'm using a non relational engine, then that means I have to give up on the idea of transactions, or I have to give up on the idea idea of being able to query in SQL, etcetera. Okay. No. No. No. No. No. Hey. He this is
[00:26:52] Unknown:
okay. I'm sorry. Just hit the big red blinking button. Transactions are important. Like, I don't understand how you can build a database without transactions, and the moment that, again, going back to my example of a shopping cart, I want to add an item to the shopping cart and maybe modify some details, but, oh, this is my favorite product ever. But I want both of those things to go to the database in a single unit. If I don't do that, then somehow, I need to maintain consistency in my application. I had the pleasure, dubious pleasure of implementing multiple transaction systems and needing to build something that in application code is absolutely insane.
Most of the databases, most of the non relational databases, especially in the 2, 010 time frame, did not have transactions. They are complicated. There's nothing's done, and the things that no SQL databases most wanted to do is, let me show you how fast I can go. I think that over time, 2 important things happen. 1 of them is how it will change If in the early 2000, we have a hard disks, then we move to SSDs and now to NVMEs, which means that I'm able to provide, you know, 200, 000 writes per second. Yeah. I can handle that. I have the hardware to support that. You want to handle bit of rates per second? Yeah. No problem. Give me enough, give me enough network bandwidth, and we will do that deal with that.
So that meant that lot of the barriers that made oh, how can I implement right transactions properly in a in a performed manner became road simple? I remember, at 1 point, I was running performance benchmarks, and I was trying to compare SQLite file build, and I think LevelDB or something like that, talking early 2013, 14. And I think that Firebird did not use f sync. And to people who know about databases, that means that it cannot actually be durable. And as we've been looking at it, that's not possible. And then I looked into how SQL IT data and SQL use f sync and w d b use f sync.
And I was looking at another database called ecent, which is an embedded database that comes with Windows. It's now open source, and you can look at the code. It's really interesting. And I remember SQLite was the, you know, 2 or 300 transaction per second. LevelDB, you know, 150, 200 per second, and ESET was at 20, 000 per second. I could not even understand how this is possible. At 1 point, I'm looking and I'm tracing the, the system calls and I realized that, oh, it doesn't use f syncs, it uses direct IO. And I remember implementing similar features afterward when we built our own storage engine and I and implementing something called transaction merchant. So you have multiple rights to database or only the transactions that are being aggregated and set in 1 shot, and the performance jumped by about 15 times faster.
So, the requirements for for from the users is you cannot give up transactions anymore. It's not, I I want to I want to be really fast. I want to have all of those sort of things, but I still absolutely expect you to give me multiple features like transactions for queries. There was the there are no options. I want to be able to from a data here, it should be fast, but I also want to be able to operate on that all sort of interesting ways. Another aspect that is, really interesting is, again, going back to the hardware, I can go and buy an 8 terabyte disk for a few $100.
Enterprise edition with, you know, with trade and everything, wouldn't break the bank. And if I want to, I can go even higher, and that means that I'm no longer constrained with what I can do on a single machine. No longer concerned about what is possible. My CPU is a lot a lot faster. So I have a lot more available capacity and that's, I think, very justifiably so, is translated into 2 things. Force, we are much faster, and the other thing, the users aren't willing to accept driving uphill both ways in order to achieve the performance goals.
[00:32:09] Unknown:
For teams who are doing the calculus of choosing a relational versus a non relational engine, I'm curious what you see as the major factors that push them in 1 direction or the other, and in particular, how the underlying data model might shift that decision making where maybe they're used to tabular representations, but they need graph for being able to do relational traversals, or they are interested in being able to use the document store, but then they're concerned about being able to join across different collections. Just wondering how how that influences their their choices.
[00:32:50] Unknown:
Yeah. So for almost all all TP applications, I would suggest going and looking at start from the user interface. Start from how the user thinks about your system. And even if we are rendering a lot of tables to the user, at the end of the day, most of the things that we work on today are complex object graphs. When I'm talking about complex object graph, I'm not talking about graph in the sense of social networks or find my friend or something like that. I'm talking about what is the sort of model that I'm working on. And think about what credit card statement looks like, or a vacation request for user, or a insurance policy, hospital visit, those sort of things. All of them are really complicated.
They are not something that you can easily throw into table or form and then render back again recently. In most cases, I love modeling and, again, I'm absolutely biased here just to be clear, but I love modeling on pen and paper. Here is what this looks like. And I'm looking at this document, and then I need to decide, is this something that I can just take 1 to 1 and throw into a document database, or does it make sense to try to throw that into a Tableau database, relational database? And for reporting purposes, I think that, relational databases or even dedicated column slash column databases are still very much, the right tool for a job. But the difference here is that my queries are more complicated. I'm working with more data for transaction operations.
I tend to I want to touch this specific thing, that thing, etcetera. And that, I think, feeds a lot more closely to the document mode. Something that we haven't touched on, but is important is that we tend to not just have a single store across our system. So I would have let's say, we're talking about insurance application. So I may have time series data for the half rate of my users and measure and and then I have the policy document with all of or, you know, you have pre existing conditions of going bald. No, no hair implants for you, whatever.
And, again, think about the insurance policy. That is a really complicated structure. Try to throw that into the relational database, and good luck trying to get it out of that. But, you know, I also want to be able to search for, oh, have we seen this this type of claim before? This guy has been totally a a car every 3 weeks for the past past year. Maybe we should stop, insuring him, stuff like that. So we're looking at multiple data store in many cases. You mentioned graph databases and graph databases in many cases are used specifically to solve wrap problems. It wouldn't be the the system of store.
If I'm comparing, Elastic as, again, not a system of record, not where you put the data, but you tend to throw a lot of data into that, and then you maybe use something like KeyBanc or to gather some information out, or maybe try to use unstructural search and just try to discover things on the fly, etcetera. For the system of record, however, there are different considerations, whatever this is, transactions, complexity. The ability to query the data, I think, is quite important. Other aspects when you mentioned joins across different tables, I want to challenge that statement and talk about when do you actually need to join between tables?
Is it when you're operating over different aspects of the same aggregate route. So if we take the, shopping cart and line items, if I want to render the shopping cart, I often just need to join to the line items and the product tables as well. But if I'm rendering the shopping cart as a document, then I don't need I I I can just say, give me these documents, and I have all of that information directly in. If I'm now I'm going to talk about a specific feature for Vendee here and this is something called include. You can say, hey, I want you to, look at this document and look at those parts which represent additional documents and give me this document, give me those related documents as well.
In MongoDB, this is some there is something called a lookup, which shows a similar purpose for a join. And the idea here is that you can traverse between documents even in the same even in an animation database, but it's not a join. And the reason that include is not a join is that it's not a quotation product, it doesn't modify the way that your output behave. In most cases, this operation is applied after query processing. So just before you need oh, here is a set of results. Let's now include project, whatever, get data from related document, etcetera. That matters a lot because it means that you don't have to deal with all of the related complexities.
[00:38:36] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow from migration to DBT deployment. DataFold has recently launched data replication testing, providing ongoing validation for source to target replication. Leverage DataFold's fast cross database data differing and monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale and receive alerts about any discrepancies. Learn more about DataFold by visiting data engineering podcast.com/ DataFold today.
The latest wrench to get thrown into the works of the database market is, I would say, most people would agree the advent of generative AI, which has caused virtually every database vendor to say, oh, I now support vectors. And I'm curious what your sense of that is as somebody who's building a database engine. How important is it for the vector to be a native data type and to support the various algorithms that need to be able to operate on those vectors and just how you see the changing landscape of database adoption, the number of different databases that are out there, and the ways that people are trying to leverage those database engines for new and evolving use cases and just how you how you think about that pushing the the database market going forward?
[00:40:05] Unknown:
I'm going to go out on a limb a little bit here. If we were having this conversation 2 years ago, I'm assuming that you would be asking me about blockchains and the ability to use quantum proof encryption, you know, quantum proof, we need to focus something like that. And I thought then, I think now that, blockchain is an interesting way of maybe managing money, but it's a really poor database. Vector search, when you think about it, is from a technical perspective. Vector search is really nice statement, but let's talk about what it actually means. It means that I have some text, I hand that into an NII, and it gives me an an embedded list of position in vector, vector of position.
And then I need to be able to search me something that is near here across all of the dimensions that they have here. So the algorithm is k n n, K nearest cell Kirsten neighbor, sorry, or something like that. So from a computational perspective, this is a very obvious thing. What's the problem? The problem here is that this is really niche to really specific scenarios. Under what cases do we actually want to do a vector search when, oh, I want to load the related doc the the related products for what I have in my shopping cart. So it's accommodation engine, for example. That doesn't happen.
And most all TP systems don't actually, need it. And it is amazing feature in terms of, oh, I have generative. I hear this the checkbox on this, but I don't like that. This doesn't actually provide a lot of value for the user. Other aspects is the inscrutability of AI, and that means that if you want almost identical inputs into the system, you're going to potentially get very different outputs and I find that this makes me quite nervous in terms of predictability, performance, and those sort of things. And finally, what's the actual use case? I mean, I'm trying to think about, oh, I want to build a chatbot.
Okay. That's great. How is that related to, what you want? Oh, I need to be able to summarize finding based on what the user is asking. Asking. Well, here is how you do that with not needing any fancy vector search. Take the user question, ask the AI, give me the concrete keywords for what I should be asking on that, and then query over that. That gives you highly predictable, well tuned system that you can look at as, oh, here is the black box and everything else is visible to me. And then you take the output of that and then it says, okay, now I understand.
Now, that here let's say I'm searching, give me all of the drones that can connect to a Android 7, whatever, and have over a 150, height capacity. Okay. Great. I can take that, ask the AI to the model to give me the keyword, which is maybe drawn, whatever model is required, etcetera. And then I search for that on the products, and then I can feed the products today to present that to the user as necessary. And this massively reduced amount of, magic that you have in your life.
[00:44:10] Unknown:
I appreciate that answer because it it's it's definitely interesting to get the perspectives of different people for these various hype cycles and how much of it is actually a lasting question that needs to be solved for and how much of it is just a point in time that has gotten people all, excited and it will eventually die down because there isn't any fundamental shift in the ways that the world works. Now looking to the process of actually designing and building a non relational database engine as somebody who is doing it, I'm curious what you see as the first steps on that journey and some of the ways that you think about the initial set of questions that you're trying to answer that will get you to your eventual goal and how many of those questions you can actually answer as this is the primary functionality that I want before they start to conflict with each other?
[00:45:04] Unknown:
It's funny. If you look at database engines, there is at the bottom most teal, you have something called a storage engine. This is how the database actually reads and writes data from the disk. And I find it funny because there is a very popular storage engine called RocksDB. And Rocksdb has been used as a storage engine for MySQL. There is something called myRocks. As well as the storage engine for many non relational databases. And at the end of it, all databases have to deal with physical realities. I need to be able to search something, so the underlying data structure has to be sorted in some way and all I have to offer you is, supplementation on that. Now if we look into the actual model, then the the network protocol, I mentioned the chatty versus batch, transaction models.
Am I talking about individual values and am I modifying them on bulk or not? All of those are really important questions that you have to to think about. Am I a good example maybe let's talk if I want to submit individual queries although I submit the whole set of them for the engine to operate on, there is a tremendous number of trade offs that you have to take into account and how do you operate on that. I think that when you start thinking about the database design, sitting down and writing the sample application, not solely as writing a sample application, but writing the, use cases and then here is the set of endpoints that they have and here is the interaction. So go and sit and read write the sequel that you would have there. Here's the data that I write. Here's the data that I read. Here's how I, do some things. And that tells you a lot about what you want to do.
If I'm comparing, for example, am I interested in running aggregated queries? I mean, if I'm using Cassandra, then the answer is no because because there is no way to actually do that. If I care about running only aggregations, then a column family is a much better choice. If I'm running a streaming data source, then I care about my state and very rapid ability to load a particular state, modify it, save it back. Then we have to figure out what is the I don't know if calling the secret sauce is the right thing, but what is the my strong what is the thing that people come to me? This is what I need it for, and then everything else comes from that. For Raven, my thought was I want to be boring. I want you to get the data and go do your own thing. And everything revolve around that. Oh, I have to be boring, so operation needs to be simple.
I mentioned the the the count and paging earlier, but that also meant that my modeling had to be simple. Simple. So I went with document basis, and you define the model in your code, and it's very easy to change and update over time. If I care a lot about graph versus and how the how do you how do you tuples or something like that. So the question is what what what's my purpose? What am I trying to to achieve? The interesting thing here is that again, I've been doing it for 15 years, and the the initial thing that you start with has a huge impact on your overall design. But over time, you realize that, hey, there are other things that I can also approach and do.
So, WebAdb has support for managing time series directly inside of the the database. Almost accidentally, we realized that, you know, we have a a a document model and we allow you to write to multiple nodes in the cluster at the same time. So it's not a single family. Every node can be a a a can accept rights all the time. Oh, what happen if I put those nodes on drastic different geographical locations and now I don't have just a cluster, I have g distributed 1. And then also some of that just start falling down from those early design decisions. And there is a very strong need to make sure that this is genetic cohesive end result versus, you know, I can do this and this and this, but there is nothing that pulls this together. 1 feature that we end up implementing was the idea of allowing to do background processing of a query. So we can write a query and subscribe to these results as a persistent, status.
And the example I'd have to give is, oh, whenever an order is set to paid, now there is some other process that ended the shipping of them. So I have a query such as form orders where status is paid, and I have workers that are subscribing to this and accept that and then they are all about being able to, reliably process that. This feature actually came about from implementing replication, so I know where is my replication status is on data node, but I was able to leverage that to implement persistent queries on the other side of that. So there is a lot of impact on those details. If we didn't have the cursor position for application, we couldn't utilize that setting for a persistent query cursor at the same time. The exploration of the evolution of the requirement set is interesting
[00:51:19] Unknown:
And as you said, you added the capability for managing time series. Your core focus was just how do I make this boring and easy to use? And I'm wondering as you have added new capabilities, as customers have asked for new things, as the open source community has tried pulling and pushing the engine in different directions. How have the early technical decisions of how to actually implement and architect the engine hamstrung you in those efforts of evolution and some of the, unanticipated challenges and pain points that you've run into in the process?
[00:51:57] Unknown:
So the primary issue that we run into, when I started writing revenue for the first time, I was primarily application developer. And Revenue is written in c sharp, and I wrote it initially as a pretty standard ASP NET application, which meant the protocol that they will be uses is rest based, HP calls, etcetera. And that was a very native way of doing that, and it worked beautifully in the sense that we're able to rapidly get it out, get users, and everything. And at the same time, 1 of the we ended up in 2014 or something like that, we had almost consecutive set of performance problem for users.
And you end up realizing that the way that we even approach the entire things was broken because we are spending 85 to 95% of our time in garbage collection. So at that time, I remember, it was like fighting fires and you spray some water there. And before you manage to quiet this thing, you have to 1 of our to that location and spray some more water there. We started having users with, you know, 100 gigabyte datasets and suddenly running a GC on that sort of data set, it's just okay. That's expensive. And we generate a lot of garbage to an operation because we never actually policed our location. We just wanted to get the software out of. So in 2015, we actually took a step back and did something that I'm really really proud of. We sat down and did a post mortem of the Intel system.
And he says, this is or here, all of the things that we did badly. What would we do differently if we started it today? And then I did something really stupid, and I we basically rewrote the engine from scratch, while still supporting the old version. And this was I'm saying stupid because it took us close to 3 years of old wiring and new version, and we had to to spark to compatibility issue along the way. It was really really painful. But the end result was an engine that was more than 10 times fast and much easier to work with work with and extend. And I want to give you 1 particular example of the major difference that is in front.
So RabbitDB is a document database. It stores and operates over JSON documents. Now how do we store JSON documents? The native way of doing that is to just store the JSON text, and then you can upload over that. But then you realize something really, really nasty. JSON is a really poor format if you need to do a version of that because, let's say that I want to say, give me the names of all of the employees in London's. When I have an index on the city that you're at and I can, start executing on that. Perfect. But now I have a list of, all of the app reason. I need to fetch the name. How do I fetch the name? If I'm using JSON, I need to actually parse the text.
That's expensive. And if I need to do another, to get another field, I have to pass another piece of text. So it's actually in most cases, it's easier to pass the text once into a JSON object in memory and just run it this way. And because I wanted to avoid parsing JSON over and over again, I introduced a cache into the mix. So possible. It makes perfect sense. I had amazing benchmarks result because of that. Move a few years later, data sets grow and amount of work that we do has also grown tremendously. And we've seen a lot of this is them.
Every time so and the crazy thing here is that when I profile the system, I'm not actually seeing a lot of time spent in JSON parsing. I see a lot of time spent in in GC, and I cannot understand what's going on. And then I realized that the foreign set of events happen. We are running on new to new to full capacity, so almost no memory. We are loading data from the disk and pause that, throw it into the cache, and run on, boom. Now, the process in a JSON object, JSON document is composed of many different objects. So and it's basically an object half. So we serialize that into deserialize that into an in memory object graph and stick it into a cache.
And the problem then is that the cache gets full and we have, we are running GC. This is running and everything in the cache get pushed into a higher generation. But now I'm about to run out of memory. Well, the cache will respond and evict something for memory. But this situation basically means that now the cache that I created has turned itself into a a an evil made process, basically, that would keep the cached values in memory just long enough to push them to gen 2. So when we actually about 100 of memory, we would have to do a really expensive gen 2 collection. And 1 of the things that we did with the new version of 5 n d b is to not use JSON internally to store the data. We're using something called Plittable, which is a a 0 0 copy format, which allows you to we search for a page. We get your pointer. We use memory map files. We get a pointer to the document.
And then we can, using just pointer arithmetic, get to the exact name field that you want. No need to parsing. No need to parse anything. No need to keep anything in managed memory. Everything is already in unmanaged memory, which means that we can lean heavily on the operating system for memory management. It means that, repeated queries on the same thing benefit from the page case of the OS and all sorts of other stuff like that without needing to pay any GC costs. And that was an an exquisitely sad decision. I remember it took us, like, 6 months to a year to get to the point that, okay, now I can test that.
And I remember, I know this is going to be bad. We haven't had a chance yet to actually do any, major optimization. And I'm I'm comparing that 1 versus the other and 1 of them was able to hit, I think, 3, 000 queries per second. This 1 was able to hit 50, 000 requests per second. Like, okay, we have we really have something there.
[00:59:33] Unknown:
You mentioned that at that point where you were running into those performance issues, you effectively had to rewrite the engine from the ground up with the concept of if I had to do it over today, how would I do it? I'm interested as you stand now, now that you've gone through that pain and out the other side, if you were to start a new database engine today, what would you build?
[00:59:55] Unknown:
Interesting. So an engine today would probably be using the batch transaction model. It would be talking over TCPIP or QUIC or something like that. So with multiple streams, Probably not quick, however, because of its reliance on TLS. I want to use something that doesn't involve expanding certificates. It would probably be, if I had to choose, that there is a difference between what I think would be a really good database to build from a technical perspective and what would be a good database to build from a commercial perspective. Probably something that would allow me to run on a synchronous data to, to mobile devices and things like that just because of the what applicability and ease of use.
Other aspects here would almost certainly be non relational. I don't think that I don't, think there is much value of in under a relational database. The distribution model would probably be very different. I created when I created 11 DB, I was very heavily influenced by the Dynamo paper for Amazon with the issues of, concord multiple rights and, things like that. But the situation today is very different. Would probably have a lot more reliant reliance on cloud technology. So, for example, the way that for that Remedy b works, we assume that we have at least work on. Today, I would say, hey. No. I would probably design something that is based around s 3 compatible storage.
So, my idea is, okay, I have some data and I want to operate on that, and I write that to write the headlock, and this is the only thing that needs to be maintained on the disk, and everything else is then throw into s 3 compatible or something like that. And that means that you can have native approach to, oh, here's how I read and write, of course, many system without having to deal with expensive NVMe instances or something like that.
[01:02:18] Unknown:
As somebody who has been working in the database ecosystem for a number of years now, you have your own database engine that you build and maintain and work with customers around, what are some of the most interesting or innovative or unexpected ways that you have seen either non relational engines generally or Raven DB specifically applied?
[01:02:40] Unknown:
People would take you to the edge and then push. I had a user 1 time that called me and says that Remedy is very slow for its use case. I'm like, okay. What's going on? And his use case was a product catalog, and he had something in the order of 50, 000 items in the catalog. I'm like, that's weird. That's more than within our capability. What's going on? And he put the entire catalog as a single JSON document. And I had to explain, hey. Remedy is a document database, but you are allowed to use multiple documents. And a 700 megabyte document that you have to load and pass each time is not the ideal format for what you're trying to do. I have another user who put an array inside a document with everything that happened to a particular scenario.
Some scenario may have hundreds of thousands of operations. That's now here is the the funny thing. They're reaching to data sizes in the high tens to low hundreds of megabytes for a document. And that means that every time that you open to that, you have to deal with loss of data. That's not how you want to model things, but it works. Other systems have put limits. So Mongo have a 16 megabyte limit on a document. And on 1 hand, I'm happy that I don't have that. I think that we have a limit of 2 gigabyte, which is utterly ridiculous. I didn't think anyone would even approach the 100, but people do that. So it's really funny that wherever you have an edge, people will go to.
But then again, something that we learned very early on, it's better to be slow than not at all. So think about the scenario where, you have a hard limit. Let's say, I think that, DynamoDB has a 400 kilobyte limit. That has such huge impact on the way that you design things. Oh, I want to write a blog post. I have a 12 page blog post that I'm writing right now. It does not fit into DynamoDB. Okay. So I literally cannot use this database at all. I have to restructure everything from the get go. But it's, oh, you're running inefficient manner. Okay.
Fix it, but it's not critical. I think that's that's the thing that, I'm more so from this. Other thing is, Remedy has a way where you can define index operations that not just indexes field, but also on this computation. And at some point, people hit limits that they didn't know we even had. For example, someone puts a computation in 11 DB that would generate an invalid program exception. Why is that? They had higher than 65, 000 local variables in this computation. I'm like, I don't even have a way of responding to you. You send me this index definition and I mean, oh, that it even got started working.
Don't do that sort of thing. So I think the the the the the sort of thing that people would put you to use is just flat automate.
[01:06:17] Unknown:
And in your experience of building Raven DB, working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:06:30] Unknown:
If it can happen, it will happen. Just to just to check, I'm, while I'm talking to you, I'm working a system with a 4.1 gigahertz, CPU. Which means that on every 1 of the 12 nodes that they have available, I can do 4, 700, 000, 000 operations per second. If you say 1 in a million, it's right now. We run into any source of race condition in my code, between my threads, between distributed systems, and it's just really, really strange to see that. 1 of the best investment that we did is generating additional information about how the system is working, what's going on, those sort of things. Beyond that, correlation is something that would stop you in way I cannot even explain.
We had a bug. That bug was initially written in 2, 013. In 2, 024, we had 4 separate issues within the span of 2 weeks. All of them exposing this bug from several different customers. I'm like, okay. Yeah. I get this is back. We feel we'll fix that. But how come no 1 ever found that in that amount of time? That's utterly this. And now, 4 people 4 different people, unrelated 1 another, found it in the same time. And that's something that we saw happening over and over again. We had Windows 2016 and loyal have a global lock for the, page table.
So, okay, if you do things just right, you can spend 90% of your time in kernel mode. I've been running a a Raven since 2, 009. So the actual product was it existed in 2, 000, not just design. And we discovered that back in 2020, after 2 separate customer came to us within the 17th about this. How come we never figured this 1 out before? And then you realize that, oh, you have to have these 6 different things happen all at once in order for this to actually trigger.
[01:09:10] Unknown:
And okay, I understand it. So how come it triggered? No. I have no way of explaining that. Yes. The the interesting laws of probability where you see something and you say, oh, that's super rare. I don't have to worry about it. But then you start to operate at scale and that rarity happens all the time similar to, the the issues that Google and Amazon run into with hard drive disk failures where if you run enough hard disks for long enough, you're gonna be failing all the time. For people who are looking at selecting a database engine, they're trying to decide what exactly they need to use to solve their particular use case. What are the situations where a non relational engine generally or Raven DB specifically are the wrong choice. If you're running a reporting database, most non relational systems are probably not a good idea.
[01:10:03] Unknown:
If you have a relatively small amount of data and huge variance in the way that you query that, So that's probably SQL is a really wonderful language and the sort of things that you can do from a relation engine can be quite amazing. But the funny thing about it is as your system grows, the number of operations that you're doing reduces because you would have the transaction database and the all up database and the important database, and here you have the here it goes to Kafka to do stream processing, etcetera, etcetera. So try to to select, something that can cover your scenarios.
At the same time, there there used to be the case that, oh, everything needs to be a separate micro service. And now, you have the monthly is called a distributed monolith. So trying to fit everything into a single location, especially if it's a bad fit, like, complex object models inside of relational base where you have to stitch them together all the time using joins is a bad idea. But, in the same sense, is trying to get a non relation database to do relations, and this is really, really common. 1 of the things that we struggle with is users who build their models as if everything was still in a national table and then they can figure out how to code themselves out of this situation.
[01:11:46] Unknown:
And as you continue to build and iterate on and evolve the RavenDB engine, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?
[01:12:00] Unknown:
So, I consider revenue now to be a really mature database. So, we have all of the features that we want to have in the database. What we are looking now is into the ecosystem and whatever this is, okay, let me show you how we can integrate with active frameworks in order to make your life easier, how I can, plug in to, additional elements in your environment. So using, Kafka. So here is how WebMDP can pull and push data from Kafka, natively without you needing to write your code. Here is how you can distribute your data to, start using serverless, more easily with Drive and DB, those sort of things in the sense of, okay, you threw the data in, what do you do with that? And we want to ensure that we have a good answer for whatever you want to do there.
[01:12:57] Unknown:
Are there any other aspects of this space of non relational databases, the design and implementation of them, and their applications for end users that we didn't discuss yet that you'd like to cover before we close out the show?
[01:13:11] Unknown:
I don't think so. I would like to recommend building data intensive databases, which is a great book. And if you really care about the databases, albeit from inside out, there is another way database called database intelligence. None of them by all my book, by the way, which is a really deep dive into how databases is built from the ins and outs,
[01:13:38] Unknown:
both for local databases and distributed ones. Yeah. I'll second the, recommendation of designing data intensive applications. That's a very well written book. I learned a lot from it. I will have to take a look at the database internals book. And for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. As the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:14:09] Unknown:
Where is your data? And where where is your data is located? And where is it going? Just, I think that for most enterprises, just understanding that is a huge challenge. If I'm talking about I have a customer coming to my system, where is it located? What database is where it's located? And especially, when you put in data governance, g t p r, those sort of things, this is a huge complicated task, to deal with. Especially as your system goes, just traversing what's going on can be an incredibly challenging task.
[01:14:51] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you have done on Raven DB, your expertise, the ecosystem, and your insights into non relational database engines and their design. I appreciate all of the time and energy you've put into helping to move this space forward, and I hope you enjoy the rest of your day. Thank you very much. Thank you for having me.
[01:15:20] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Oren Aine
Accidental Entry into Data Space
Frustrations with Early Database Work
Early NoSQL and CouchDB
Defining Non-Relational Databases
Complexity in Data Modeling
User Requirements for Non-Relational Engines
Choosing Between Relational and Non-Relational Databases
Impact of Generative AI on Databases
Designing a Non-Relational Database Engine
Challenges in Database Evolution
Innovative Uses of Non-Relational Databases
Lessons Learned in Database Development
Future Plans for RavenDB
Final Thoughts and Recommendations