ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute, and go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy, and today I'm interviewing Jan Stucke and Jan Steemann about ArangoDB,

a multimodel distributed database for graph, document, and key value storage.

So, Jan 1, could you start by introducing yourself?

Yes. Hello. My name is Jan Steemann. I'm a developer here at a Remedy in Cologne, and I'm working on the core functionality of a WranglerDB that is mainly the query language of a WranglerDB cluster functionality and, lots of the internals. And,

I focus on on this these kinds of things and also performance optimizations of these things.

And Jan Tu, how about yourself? Yeah. Hi. I'm Jan Stucke. I'm, the head of communications here at IMDb.

I joined nearly 3 years ago,

and I'm mostly concerned with, communicating,

MongoDB and, creating,

technical documentation and training stuff and helping the community on Slack, for example.

And going back to you, Jan 1, do you remember how you first got involved in the area of data management?

Yes. So it's a while ago. I I've been using

SQL databases for at least 15 years, I'd say. I wrote a lot of SQL for other databases and, was always involved into,

how to make SQL queries perform well and all this. And,

eventually, I I got into performance optimization of SQL and looked into source code of other databases and tried to hack around there. And when I got the offer to join a Wrangler DB, it was actually not named a WranglerDB that at that point of time, but it was still, also already database company. So I took the offer and,

yeah, always want to work on the on the database of my own. And, yeah. So I accepted that offer, and I'm pretty happy I did it. And Jan too again. How about yourself? So I'm coming more from the from the product and communication side. So my first job in in the tech industry was, I think 9 years ago. Then I designed,

software processes for, for an ecommerce platform,

and then more and more,

developed into the direction of, project management and product management. And then,

worked at another company called Next Level, which is an IT recruiting company. And there, I was mostly responsible for,

for voice and tone and all this communication stuff. And then, as I said, 3 years ago, I joined then at MDB

as communication and marketing team, and here mostly focus on all the technical communication stuff. And I

gave a high level description of what ArangoDB

is, but I'm wondering if you can give a bit more detail about the product

and what was the original motivation for creating it.

Sure.

So in a nutshell, Agonyb is a disputed

native multi model database, and we support,

as you already mentioned, a key value document and graphs.

So the database also supports join operations and multiple graph,

algorithms like shorts path or pattern matching, but also

distribute graph processing with, pegel. We're open source, so, under Apache 2 license,

written in C plus plus and, MongoDB also supports, asset transactions

and, it's called a schemaless database.

But if you want to validate,

schema, you can also use our FOX framework for data centric microservices.

FOX is, basically a JavaScript framework based on, Google's V 8 engine

that, that you might know from from your Chrome browser. And what it does, it is, allows to bring data,

data intensive business logic closer to the data. So you can run

queries directly on the server,

and,

any kind of business logic that you like. So we also have, of course, a management, which is based on the rough consensus protocol,

which we integrated into our agency, which is basically the brain of our cluster. The roots,

for rambling b go back to 2012.

So the project was started there, and, the company was found then in 2015

when we also received our initial funding.

In 2017

there, we were,

pretty happy to announce that we could close, the seed round, led by Target Partners,

a tech investor here in Germany,

and, received a total funding of around,

7, 000, 000, 000. So,

the idea and motivation

for of our 2 founders of Frank and Claudio's

roots in the previous business. So both are pretty experienced in building, databases for over 20 years now. And prior to starting the rmdb project, they already had a successful database development consulting company called Try Against. And they did projects for New York Stock Exchange, Deutsche Bank, you name it. But what they learned,

during this time is, yes, you should use the right data model for the right job, but why do I have to learn a complete new technology for doing so? Frank and Claudius then, found a way to combine

these, these 3 data models, and that was basically the moment that gave birth to the ArmDB project.

And you mentioned that when it was first created, it didn't bear the current naming. So I'm wondering if you can share where the name Arango DB came from and what was the motivation for the avocado as the mascot.

Yeah. The name We get this, this question a lot. So at first, we wanted to call ourselves,

AvocadoDB

because there are, I think, round about a 1, 000 varieties of avocados,

on this planet. And what we like is that they all look similar from the outside, but if you take the take, take a closer look, then they are quite different. And this reminded us a lot about how people see data or might perceive data. But,

there was another company from my understanding, not at all a similar name and they threatened to sue us if we keep the name. So we had a look, at the 1, 000 varieties of avocados and choose ArangoDB

because we thought it sounded fast. And today, I think we are all pretty happy about choosing, RangoDB as our name. And so

given the fact that it's multi model and it's all running on the same engine, there must be a single way of representing the data at the storage layer. So I'm wondering if you can discuss

how

the database is constructed

for being able to support all of these different views on top of the underlying storage layer.

We have storage engines in the background. So our Angular provides 2 different storage engines. Both have, totally different sweet spots, and they share a common interface. And the storage engines just, they handle

the, like, very low level operations to, for retrieving the data, for storing the data. Like, for example, you get data by primary key,

store data by a primary key, or, do an index scan over a specific range, something like this. So,

at the storage level, it can be quite different how these, storage engines implement, actually, the, data retrieval and data storage. But, on the on the layers above that, we have, our query language and all these, higher level APIs that the users interact with. And they provide,

like, different views on the same data that is stored by the storage engines.

So

it allows basically to, to get the data from the storage engines and to to use the the same data

in, like document style queries or table relational style queries or in graph queries, and to bring all this together and to, to to put the same data that is

hosted by the storage engine models to,

to make

use of that and to bring bring all this data together and to to analyze that or or view that in in different

different fashion models.

And when you're

storing the data, do you need to determine at right time which interface you're going to use to view it? So do you need to say this record is being stored as a document, this record is being stored as a note in a graph, or is it something that can be dynamically

modified where you maybe store it as key value and then let your later construct a graph between nodes and be able to query it via different semantics at at read time? Yes. And I think that's 1 of the strengths of the RemuDB. So, when you store the data,

you don't have to know upfront that you will use the data in the graph later on. So you can basically just go ahead, store your data, and later bring it together by, more or less, running a graph query on it. And, this is this is, as I as I said, it's 1 of the advantages.

You have your data that you that you have stored, but you can analyze it in different ways later on, and you don't have to know all the ways upfront.

And with different ways of

querying the information,

I imagine that there are different performance characteristics.

So

are there ways that you can

tune the storage layer or tune the way that you are writing or indexing the data to give better performance based on certain query patterns?

Yes. So all the data obviously can be indexed in a

in different ways. So we provide, several different indexes and, yeah, you can choose the the right index for the for the work at hand, basically.

The storage engines also come with different sweet spots. So 1 of the storage engines, keeps most of the data in main memory. So it's, pretty fast if you do,

if your data set fits in RAM, and the other storage engine is based on RocksDB,

which allows handling,

much greater volumes of data. And, they provide a lot of tuning parameters.

That's the indexing that that can be done. We have the,

our query language comes with,

I'd say, it it's a quite complex query optimizer that tries to, to optimize the queries,

that are run against the data. It also distributes the data

properly in the cluster, and, the the individual query parts are automatically distributed across the nodes of the cluster.

So there's a lot going on in the background. And, yeah, if there's manual tuning required, there's a lot of, configuration parameters that the users can still choose to to modify.

And what are some of the benefits that a multimodel

data engine provides versus something that will only support relational or document or graph or key value storage?

Yeah. 1 of the main benefits is that, it doesn't force you into a specific direction. For example, if you migrate from a SQL based database where you have this, relational model with the, I'd say, the the tables and all the all the roles in the table have the same structure, you can migrate this model easily to a WranglerDB because in a WranglerDB, we have these collections, and the collection can just be seen as a table with, all the documents in the in the collection having the same structure. But if then,

you find that you want more flexibility, then you can start by making modifications

and you can,

for example, give certain documents a different structure. And, you can then, later on, bring all the data in the tables or collections together by by adding,

indexes, by joining

them. You can also add parts of them or all of them to a graph and later use them in graph queries. So, you're not forced into a specific

mode or model, from the beginning, but you can do this more or less as you can see fit. And,

yeah, you can adjust to your needs, basically. And are there any cases where having the multi model view of the data becomes problematic or confusing for users? I think that's a very good point. So, internally,

challenges

challenges that other distributed databases face.

Supporting multiple feature sets for the different models is of course, a bit more work

compared to, to, a single model database, for example. So a bit more work on the query engine,

maintaining AQL, building new function functions there, and all, all that stuff. But, I don't think this is, anything,

inherently problematic

problematic to, which model itself. For users,

that's a very good point. So I think it's it's it's cool to have a database,

that, that allows you to do a lot of things. But,

what our users

often face is that now that they have the simplicity to choose, okay,

I take I can use a join or I can use a graph traversal or whatever access pattern they want to use. The question is, which 1, should I use for,

for my application or for part of my application? And we try to, to have our community a lot with, our community support,

Stack Overflow and, and on Slack,

and also provide us many training,

content as, as we can. So we make it clear,

when to use which, access pattern. So I think a general rule of thumb that is, good, for initial guidance is if

you know the depth of your search before you search, then the drawing tends to be, more efficient compared to graph traversal. But a graph traversal

is perfect if you don't know

the depth of your search or if you're searching in a range or a variety in-depth.

So as graph is,

not a super new concept, but gained a lot of, popularity over the past years,

but it's still still young.

I think, yeah, getting comfortable and used to to working with graphs is,

is currently,

is currently our our task. So, educating our users and, helping them to get used to using,

the different access patterns. Are there different approaches that are necessary

when doing your data modeling

versus a relational engine?

So, basically, you can model the data the same as in the relational engine in the very beginning, and this also allows, like, as I said, migrating from a relational database to Remedy quite easily because you can have your,

tables as collections and then use joins to bring the data from these tables together,

provided that you've set all the indexes as you will do in a relational database and so on. So, this this would work all the same, but then you can really, if if you've once you've done that, I mean, you can then

think of making any extensions or adjustments that wouldn't be possible with the relation engine, like really making use of the document structure or of of the document model that,

you really have the flexibility of having different attributes per document. So you can have, like, variable attributes,

if you like. You can also

utilize the power of graph traversals. So, for example, you can

put relationships between different things, and model them

and treat your data more or less as a graph and do traversals,

like, starting at a specific document and find all the things that it's connected to.

Like, I mean, 1 of the

basic examples would be, like, a social network where you have, a person and you want to find the connections of these persons, not only the direct ones, but also the indirect ones, which is,

1 example that's always used. But there's there's other examples where you can make really use of the graph approach. And, for example, modeling a network or components in network is a perfect example of modeling data as a graph. And with the Wrangler v, you can just do this. So you have all your data in in the individual collections. And later on, as you see fit, you can bring them together by,

defining a graph that consists of these collections and the relationships between them and run a queues on them. And 1 thing, so users coming from a relational engine or from a relational database, they will be accustomed to using SQL.

WranglerDB doesn't have SQL. So, there is something else that they would need to use. We have a query language that basically allows to to do all the data retrieval and modification that SQL does, but it's it's not SQL. It's AQL,

or MongoDBQL

language. It's similar in the concepts concepts to SQL, but, it has a slightly different syntax. So this is basically the the main thing that that users need to keep in mind. Should they be coming from a relation engine? And there are some other multimodel databases in the market, most notably, OrientDB.

And in some capacity, you could even think of Postgres

as a multimodel database. So how do you view the position of Virango as compared to these other databases that are available?

Well, in terms of of Ondb,

the basic difference between,

the both technologies is, that MongoDB c c plus plus based and ONDB is written, entirely in Java. ONDB offers an SQL ish

dialect to retrieve data. And ARMDB, as Jan already said, uses AQL,

to combine all the supported data models. As far as I

know, ONTBI, they don't have something similar to the Fox framework.

What I personally like about ONTBI is the ONTBI Studio, which is the web UI, and you have a lot of nice,

visualization options, for example, for graphs and stuff.

And

yeah, we have to we have to catch up, at the moment. We have, functionality wise,

a good web UI, which can also visualize graphs and, have, some layout options. But I think from the beauty side of things, we can, we can improve there. For more details,

I would not feel comfortable to comment more on ONTB

because I'm just not an expert, but I think everyone can Google about some feedback or experiences or, the specs about technologies.

But I think in general,

we have to see

what happens now to ONTV after they got bought by Kaleidoscope and Kaleidoscope

got bought by SAP.

And so,

I just don't know, what will happen to the project. In terms of Postgres,

I mean, Postgres is, is a great database, period. And so, they are now there for, I think, over 20 years. They built in a lot of great functionality,

and a lot of, I would say what I can read on the internet, also

fast and stable.

I can't at the moment

say anything about, their abilities

to do a graph algorithm, algorithms like pattern matching or shortest path.

We try to to include postgres, and those queries into our, open source performance benchmark

just to get a a feeling for for ourselves, but we just didn't manage, to to get this done. We already reached out to our community, to people who have experience with with Postgres.

But as far

as we we can tell, no 1, could manage to to to do those, graph queries efficiently,

in Postgres. So I think this is a difference

And at least to our perform benchmark,

everyone knows,

vendor performance benchmarks,

might not be always fair or could be biased. We open sourced it completely so everyone can test it for themselves. And I think with a Tableau model, and aggregations, Postgres is without competition. They are super efficient and very good. But when you want to use the schema less JSONB option, then it is not working that efficiently as far as we could, measure on in our testing.

But nonetheless, Postgres is, is a great database,

and, we have a lot of respect for the for the team and the community behind it. And 1 of the

core features

of Arango is the ability to scale across multiple nodes. So I'm curious

how that functions

in terms of

the ways that you're able to do things like transactional queries across multiple instances

and some of the difficulties that you've been facing as far as offering those scaling options?

Yeah. Let me first start with, so ArangoDB can can be run-in different modes. So the most basic mode is the single server variant

that you obviously cannot scale. But, I mean, a WranglerDB can be run as a as a single server instance too. You can add replication to that just to achieve high availability and to to have a failover node in case, your master node goes down. We also provide a mode to to handle all this failover

things automatically.

But then there's also this cluster mode, which is designed from the ground up to to be run on multiple nodes. And, it's more or less, a shared nothing approach. So,

all your data is stored on on just specific servers. Obviously, there's also some sort of replication going on. So in case, a node goes down, you still have a backup of it, so it's resilient.

And, yeah, there are definitely challenges in developing a cluster database.

Not only that you have to store all the data at different positions in the cluster, so it's highly valuable and you have a backup. But, also, if you have a complex pure language like we do and we do joints and all this, this is really quite a complex matter internally and, especially also when it comes to transactions and consistency. So So this is really a complex matter, and, I mean, it would go beyond the scope of this podcast, I think, if we discuss all the details because, yeah, really, really many things to consider there. And, we are trying hard to to optimize this, as much as possible and to make queries as efficient as possible in a distributed database, which, is probably a never ending

endeavor.

And, currently, we are putting a lot of work into our theory optimizer so it can,

dispatch things to to just, individual shots that, will actually host the data and to move data as much as close as or the operations as close as possible to the to the actual data in the cluster to parallelize all of that. And, yeah, this is really some some some big challenge, but it it's actually nice. So it's it's a lot of fun from a developer perspective.

Yeah. So so maybe just just to add and to be, open and clear about it. So we support full asset transaction transactions,

in a single instance and also in in the automated failover,

setup. In a cluster, we support a single,

document and single collection transactions.

So

we have no distributed transactions,

at the moment, but we're we're at least thinking about it if we want to support this, as well.

And in terms of the network architecture that you're able to support for the highly available

mode or for the clustered mode, what are those limitations,

such as being able to

distribute across multiple data centers

or in terms of the

volumes of data that you can manage in the clustered option?

Yeah. So the cluster version of a Remedy is, basically, it's designed for the use case that all nodes of the cluster are relatively close together. That means in the same rack or at least in the same data center. So it's not really being designed for planet scale,

in installations,

like, where you have all your nodes in different, data centers across the entire world because, that would then require wide area replication

of the data. So it's not designed for that. So our sweet spot for the cluster is basically when, all your nodes are close together. And if you still want to have, multi data center resiliency, like, if 1 data center goes down and all your nodes go down there and you still want to have a backup,

then we provide some,

we provide a solution for that, but it's it's only in our enterprise version. It's called data center to data center. It's some sort of, replication that replicates the data from 1 entire cluster to another, which can also

be geographically

far apart. So that is then supported. But, basically, if you have just 1

cluster installation, it

is basically, it's designed to to run, on nodes that are close together.

In terms of data volume,

obviously, the cluster version, is designed to for scaling out. So you can add more nodes, and these nodes can handle additional,

incoming requests from client applications.

And, these nodes are also then used to store more data. So we have 2 types of nodes in in our clusters. We have the coordinators, which are more or less doing

request processing and, request dispatching. So adding more of them will allow to have more CPU power for running queries. And then we have the actual data nodes, which are responsible for storing the data and retrieving the data. And, if your data volume grows, you can now add more of them. And,

yeah, basically, these are the things, that you have to keep in mind and when you want to scale out. But it's it's not, like, Google Spanner that,

you you would put a a million of nodes together and to have them work as as a single cluster. I think maybe 1 1 important thing to add is,

that we don't see the actual storage of the data, the problem, but rather,

accessing data efficiently

when they are stored on on different machines. So what we allow,

also in the cluster is, to scale with all 3 data models and also with,

with,

and doing

complex queries at scale. So,

when you think about the graph traversal and you think about the graph so, for example,

Tobias knows Jan and Jan knows the other Jan,

and you have to chart and you have to chart this, this,

3 nodes,

path to different machines and you then want to query exactly this, this path, then you have network latency,

for the query. And the same is true, the same is true for

joins, some, at some point. And if you now want to to

provide, an efficient,

query execution,

then you have to find a solution

to,

or you have to have access to the data locality

and also store

the data so

that the query can be processed locally without the network latency or at least reduce the network latency

to, to a minimum. And this is currently a challenge that we're working on. We have for joins and for graph traversals,

also a solution for that. But, of course, it is rather new development,

and, there's still some room for improvement.

But I think,

taking a look at the current market, we are already,

pretty pretty far. And 1 of the other features

that is fairly unique to ArangoDB is, as you mentioned, the FOX framework for being able to embed microservices

into the database. So I'm curious what benefits that provides over the traditional 3 tier architecture of having a web server in front of the database for being able to run those transactions?

So in the classical 3 tier setup,

you will have your web server execute a lot of queues against the database. So, if all the business logic is,

executed on the, on the web server, it will typically run a lot of queries. So for retrieving some data, for manipulating some data in the database, and that normally means you will have a lot of round trips between the web server and the database server,

which can actually sum up if your business logic is complex. And, with the Fox framework,

you can more or less avoid that because you can have your web server just issue a single request to a WranglerDB, which is then handled by a custom Fox FOX REST web service, more or less, that is, running in a Wrangler DB. And, that can be that can do anything that the application developer wants it to do. It can, for example,

query all the data that is necessary for serving this this particular request and do all the data manipulations in 1 go and then can come back with, 1 aggregated response. So, basically, it it will save a lot of network round trips. So,

it's it's a performance optimization, basically. And, the good thing is that, application developers can

they can tailor more or less these

custom REST APIs

as they see fit. So, it's it's really up to you as a application developer what what your

Fox REST APIs can do and will do. So you can put in all the things that you require, like validation of incoming data. You can do all the things like, access control and all this. This can all be done there. But after that, you have you are close to the data already, so you can run all your queries against the data with, relatively high performance and then

return with an aggregate result to the caller. So, I think it's superior to that. In terms of performance, it it will be superior to the classical 3 tier approach,

but, obviously, it magnets the,

a lot of layers together into 1, which, several people don't like. And, I'm not saying this is good or bad, but, I mean, that's that's the trade offs that you that you have to do. You get a performance benefit, but, then you have more codes running in the database. And how does it compare to things like stored transactions

or,

functions

in something like Postgres? Yeah. Basic basically, it's it's can be viewed as the same thing. So, you basically,

whether use prepare transaction or prepare a function and then simply execute it later on,

Fox is actually the same. You deploy

some custom REST API to the web server, and then you just call it and it gets executed. So but basically, it's

what Fox basically can do is a lot more than just executing queries.

You have full access to JavaScript. So, actually, your, your Fox APIs are written in JavaScript. So you can

you have the power of a scripting language, actually, to to implement whatever you need. It's not,

limited to running just queries. And do you have any mechanisms in place to help prevent against security vulnerabilities

or data breaches in the code that you're writing within the Fox framework?

So, basically, from our point of view, the,

Fox code that's written is just arbitrary

script code, more or less. So it's it's up to the application developer to,

to not leak any data that is, sensible and and things like that. But we have something in the framework that,

makes it relatively easy to do, for example, the validation of incoming requests. So all the,

routes that you write for Fox REST API,

they have a schema for the incoming data. And any data that is not conforming to the any incoming data, any incoming requests that are not conforming to the schema are automatically rejected. So you, as an application developer, can make sure easily that, you only get valid requests that do not contain any data that you don't want to see there. And,

obviously, then, you can implement all the things like, access control and,

parameter validation on top that that you want. There's also,

like, access control built into Rendezvous for all these, for the collection and the and the databases.

So there is actually something that can be done. But, apart from that, it is script code that the f t play application developer is writing, so that there is still the potential of leaking some data if something is not right. But, I mean, we also recommend to not expose,

Fox Web Service to the to the Internet, but still to have, a web server in front of that. So you still have the classical mechanisms of, yeah, the web server shielding more or less the database, and you can still put firewalls between that. So we still recommend to do this and, use Fox more or less as a performance optimization. So a web server only has to, more or less, make a single request against the database

and not, a lot of. And are there any particularly

interesting or novel uses of the Fox framework that you've seen? I think, first of all, it surprised

me personally how many people are actually using Fox. I was I thought that it is

10, 20% make use of it, But I think it's a safe bet to say that more of more than half of, our community is doing at least something with Fox. I think the most interesting use of Fox that I saw was quite huge hospital.

What they did was do the access control,

permissioning,

completely in Fox. So they store,

personal and medical data, in the database.

And for example,

give the doctor access to all the medical information, of course, the name. And the accountant of the hospital then then has, only access to the name and and, for example, the account

or, address data. So, what you can do with with Fox is,

do this permissioning down to the 5th level. And this was 1 surprising thing. I didn't think about that, but, it was was pretty cool. I mean, when I asked the developer, he said, yeah, well, I can learn a complete

identity and access management tool, which is, which has basically the same complexity

Another 1 that surprised me was I thought that Fox was more for people who have a really sound, technical,

background, but the guys, at Thomson Reuters and also wrote a use case about it, so I can talk openly about it. Also have business analysts who are writing quite complex queries and, put them into a Fox service

and thereby

take a lot of work from, from the developer side. So those are the 2 things that, surprised me or that I thought, okay. That's that's new. And what have been some of the most challenging

technical

and or business aspects of building and growing ArangoDB

as a technology and as a business?

That is a good question.

I would say that it is true for

every database that the technology has to provide substantial value to the user. Otherwise, you won't get anywhere as a project. The transporting the value of RMDB to developers and also companies

while they are currently around about 400 other databases out there,

where developers can choose from

is of course not easy, but

I think that the ease of getting started with a DynamoDB

and the power of our query language,

and most importantly,

the long term benefits of the multimodal flexibility

gives us an edge there.

But of course, other teams and other vendors are not sleeping. So this, this is definitely a challenge on the business side. I think native multi model is still new to many people,

despite being around for over 4 to 5 years now. I think we are already past the, does this really work question, and more and more people are testing it for their use cases. But nonetheless, multimodal poses,

a new question that we, that we discussed,

slightly before.

So if you have now the possibility to choose to choose, between different data models and the related access patterns,

you really have to decide which is the best for your application or for your use case. And this poses, of course, some new questions.

On the technical side, I would say

that our

current challenge is to provide

at least a similar experience

people make on,

when using alarmDB on a single instance, also in the cluster,

which, of course,

leads to challenges in executing,

complex queries efficiently.

So with the smart graphs and satellite collections,

we, as I said, already have, I think, pretty good answers, but I think there's still some some sleepless nights in front of us. Another current challenge is that we're working on integrating

Arango search into ArangoDB.

Arango search,

will be part of ArangoDB 3.4, which, will be our next release. And ArangoDB and Arango search.

So adam research maybe for maybe to describe adam research just real quick. Adam research is a full text search and ranking

project based on Iresearch, which was developed by a team at, Dell EMC,

who used,

already Arango 3 years, 3 years ago and based, I research on it. It was open sourced by Dell EMC,

and, last year, we were very happy that the team who actually developed Iresearch

then,

joined ArangDB

and is now working,

on integrating,

it into, the latest version and the upcoming version, of course. And the challenge here is that,

ArangDB

and,

Arango search are both pretty complex code bases. Marrying them is the first step, and I think we, we already did a good job there.

But now we also want to, of course, provide

a language search and cluster usage. And this is of course, a whole nother story, but I think we're on a good path there. But as I said, it's, it's not, it's not that easy.

Another thing

that,

we're working on is that we want to provide managed services.

So more and more vendors offer managed services for their database. And we think, it's the right way to go and something useful for people and something that people want. We're already planning in this direction, but it is also not nothing trivial. And so if you want to do it the right way, it's, it takes also some time.

But,

I think, over the next couple of, of months,

we will also provide something this direction.

Are there any other aspects of the project or the business that we didn't discuss yet that you think we should? A few things I would say. So, as I said, our research will be in the next release of AMU DB version 3.4.

What we will also integrate there is, extended geo functionalities.

We integrated a new geo index,

which is the Google S2,

index,

because we also want to support

more,

geo primitives. So at the moment, we support points and,

also ranges,

based on our GEO cursor. But of course this is,

for many use cases, not enough. So what people want is polygons and poly, multi polygons, intersections,

multi line strings, and all that stuff. So, this will also be part of 3.4.

Some other useful things is for example, schema validation, schema, JSON schema validation on C plus plus level,

incremental backups, and other,

useful things that,

we could ask for a lot of times. But of course, we're still a young and small team and it takes a bit of time, but, we're also working on that. Another bigger thing, because it's not that easy to do and to do it right is a relaunch of our WebQI. So, as I mentioned before, we want to, to provide a better user experience and an, easier use of our web UI and also the graph visualization options and, providing better teamwork with graphs. So this, thing, but, I think the biggest thing over the foreseeable future will be managed services.

So for anybody who wants to follow the work that you're up to or get in contact, I'll have you add your preferred contact information to the show notes. And as a final question,

from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today?

Well, I think if you take a look at the database market over the last, let's say, 20 years, then you have generally

22,

or basically 2 directions. So over the nineties, we had databases

focused a lot, towards storage efficiency,

but not on scaling out. And the others

other direction,

then started around the 2000s,

was focused a lot on hyperscale

while sacrificing a lot of functionality.

I personally think that there's a stack of technologies missing

that can easily

scale to a 2 digit machine cluster, for example, and and work, of course, in distributed setting, but also provide a rich and deep feature set. So I think this technology has, of course, to integrate smoothly menu development processes.

And I think this is a huge gap, but also huge demand, for such kind of a technology.

Well, I appreciate the both of you taking the time out of your day to join me and discuss the work that you're doing with ArangoDB.

It's definitely an interesting project and 1 that I've been following for a while now. So I appreciate that, and I hope you enjoy the rest of your day. Perfect. Thanks a lot for having us. Yeah. Thanks a lot, and, enjoy your day too. Take

care.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question