Power Up Your PostgreSQL Analytics With Swarm64

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/lunode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data?

Tidy Data is the DataOps monitoring platform that you've been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, PagerDuty, and custom webhooks, you can fix the errors before they become a problem.

Go to data engineering podcast.com/tidydata

today and get started for free with no credit card required.

Your host is Tobias Macy. And today, I'm interviewing Thomas Richter about Swarm 64,

a postgreSQL extension to improve parallelism and add support for FPGAs.

So, Thomas, can you start by introducing yourself? Yeah. Hi. My name is Thomas. I'm CEO and cofounder of Swarm 64.

And I'm a strange beast because I live at the intersection of of business and data

and data management programming. So that's what I do and enjoy very much. And do you remember how you first got involved in the area of data management? So probably the the first real exposure to enterprise grade data management and data wrangling

was as an intern almost 20 years ago when I was working at, Lufthansa Cargo. It's a German national airline.

And in their cargo department, they did something that today you would probably call data science. Back then, they called it sales steering. And

I basically

pulled out data out of a large IBM Cognos

based data warehouse

and,

all the beauty of OLAP cubes and the like. So that was my first exposure to that space. And I've since been always at this kind of intersection point, as

as I mentioned. So I very much enjoy

basing business decisions on a vision of the truth. And I think the most objective

vision 1 can obtain is really looking at the data and the underlying, basically, the underlying effects.

And then you can make much smarter decisions

because you basically are looking at to prove an hypothesis as opposed to just argue opinions.

So

I've been through my career, always been at that kind of cross section.

And when I had the opportunity

to found something in the data space, I was very excited about it. And so we

basically

built Swarm 64.

So can you describe a bit more about what Swarm 64 is and some of the work that you're doing there?

So Swarm 64

is an extension for the usually popular

Postgres database. And I think to your listeners, Postgres will not be a new concept. Right? It's very

widely very widely adopted

and mutually popular.

And what we do is we extend into it, and

we are basically accelerating it for reporting,

analytics, time series, stereo spatial workloads,

and also for hybrid workloads that include transactional and analytical

components.

So that's what we do. And can you give us some of the backstory of how the business got started and what it is about it that keeps you motivated and keeps you continuing to invest your time and energy with it? Yeah. That's a that's a very good question. And it's also quite an interesting journey that we took. So when we started this, and this was I mean, that sounds horribly stereotypical, but this was actually started in the Berlin co working space. Yeah. So we really,

my cofounder and I met at a co working space, and

we started

to go at it initially very much from the hardware angle. So my cofounder had developed

some of the earliest mobile GPUs,

and we were basically,

looking at data processing

from a hardware angle. And as we evolved,

we learned from interacting with our customers

that

everybody wants a full solution.

You don't want to have some kind of piece that you have to puzzle together.

You really enjoy having a full solution. And for us, really, Postgres then came in naturally

as a system we could accelerate.

And not only with hardware, that was our original take,

but also we've built a stronger and stronger software

component to it. So as I will be explaining later, you now have the choice between

software

and hardware components as you wanna add them as options.

And so, yeah, that's how we started. And I think the part that, I particularly enjoy about where we've come since we started this is that we're now in a situation

where we can really challenge

some market players.

And I'm talking about the big proprietary databases

that are really good products, but they're also very expensive.

And especially in the area of data warehousing, we can now lift Postgres, which has already a fantastic

feature set,

to a level of performance that it can suddenly compete. And this act of moving open source into spaces

where previously

only proprietary solutions

could address the business problems, that's something I find very rewarding because it's a little bit like playing Rocky Balboa. You know, you're like the small guy and you're going in and you're

playing there, and you're kind of fighting to win the title

against some of those really heavyweight champions.

And I find that quite rewarding. It's a big challenge, but that's kind of where the fun is as well. And in terms of

the bottlenecks that exist in the open source postgres, what are some of the common ones that users run into that prevent them from being able to use it for the full range of use cases that they might be intending to, and what would lead them to maybe move their analytical workloads into a data warehouse or a data lake versus continuing to run them in process with the Postgres engine and the Postgres database that they've been building up for their applications and their business?

Yeah. So I think this is actually already,

very well framed because Postgres in itself, I mean, as we all know, has been around for 30 plus years, and it's a really mature

and powerful product.

However,

I would say it has a blind spot in the area of parallelism and some of the things

that that, hang together with it. And when I talk about parallelism here, I talk about the ability to deploy

those modern multi core systems

and deploy many, many cores, like tens or hundreds

to single problem.

The kind of, MPP style processing that those proprietary products,

already master,

Postgres kind of got as an a bit of an afterthought.

So

if you look at this this feature called query parallelism

that was added in Postgres 9.6,

that's already approximately 20 years down Postgres

history lane. Right?

So it's something that has been added very late in the development cycle of the database.

And whereas it's a it's a great extension, we love it being there. It is really not going as far as we personally believe it should do, and that's why we are extending it. So query parallelism is 1 of the bottlenecks usually. When you are finding

it difficult to deploy

a lot of your cores in your multicore system to your Postgres queries,

then Swarm 64 can probably help you.

Similarly,

scanning large amounts of data,

that are not lending themselves to an index, quite simply because indexes are great if you try to find the needle in the haystack.

But what if you're not trying to find the needle in the haystack? What if you're trying to scan a range that is effectively a third of your table? Again, Postgres isn't very fast at scanning, so it will really hurt when you try to run these kind of queries.

Then another area is, of course,

the concurrency of complex queries.

So you have queries that fall into the first or the second category I've just been describing,

and then you try to run multiple of them in parallel, and you will see how your individual Postgres workers are kind of scrambling for IO and kind of competing with each other. This is something that Swarm also addresses. So complex query concurrency is something

that, is also a challenge we see in the field. And finally, and this is true for any database,

we're just trying to, you know, help and and contribute to it.

Certain query patterns are difficult to process, and there is always the question about, okay, should the user rewrite it, or should you provide some additional,

intelligence, for example, to execute certain anti join smarter and things like that? This is really, a kind of a never ending debate. Now the default choice would usually be to rewrite the queries, but there is often

a scenario where this is not desirable

or,

where this is just not an option because the queries could, for example, come from an application that the user can't touch. So query patterns that are difficult to process are kind of the 4th element. So in summary,

query parallelism, scanning large amounts of data,

many concurrent complex queries

and query patterns that are difficult to process,

those are kind of the 4 areas where we see a lot of challenges in Postgres when you try to scale it to a large degree.

And when I say a large degree,

I mean, we're talking about at least 16

24 threads

like 12, 8 to 12 cores. Right? We're not talking here about your, little system database

running on 10% of the server, having a size maybe of 1 or 2 gigabytes. We're looking at larger problems, 100 gigabytes, terabyte range, something like that.

And in terms of Postgres, it's a common database for application workloads

and for being able to do some lightweight analytics. But what are some of the common bottlenecks that users run into that prevent them from being able to use it for all their different use cases, and that might lead them to use a dedicated data warehouse or a data lake engine for being able to do more intensive analytic workloads?

Yeah. Thanks. That's a very good question. As said and as you already framed in your question, it's it's like Postgres itself is extremely versatile. But it usually struggles,

as the data as the quantity of data grows. And

generally, it tends to gravitate around 4 different areas. So the first 1 is related to

query parallelism. So parallelism has been kind of added to Postgres

when it was already something like 20 odd years

old. So you're looking at,

the so called query parallelism feature being added in version 9.6.

So we're now at version 12. So this is only actually a few versions ago. And that means that when Postgres executes, it doesn't utilize

modern multicore systems quite to the degree as you would in a data warehousing context where you're usually working with this MPP, massively parallel processing paradigm. So Postgres is kind of holding itself back a bit, and it's also missing some of the features to move data during the query process to actually keep the query parallel for a very long time. The second part is that scanning large amounts of data that are not lending itself to an index is a challenge for Postgres.

When you're implementing an index, you're usually

solving the kind of problems where you're finding a needle in the haystack or a few needles in the haystack. Whereas, when you have a query that requests to scan effectively a third of your table, indexes won't help you much. And this is really where,

Postgres will then struggle. The other part is, what if you have many complex queries? And they could be of the first or the second kind I've just been describing concurrently. Like, many concurrent complex queries, they will really push the limits on your kind of storage bottlenecks

and the way you retrieve the data, the way data propagates through Postgres. So that's another area where we are seeing bottlenecks.

And finally,

my 4th point is that no database is perfect as in being able to execute every query in a perfect way. However, there are certain query patterns from the domain of data warehousing that are much well,

they really require a little bit of special processing, a little bit of optimization,

for example,

handling anti joins differently and things like that. And this is really where there's the query patterns that can really

turn Postgres queries into so called never come back queries. So those are 4 areas, query parallelism,

the scanning of large amounts of data, many concurrent complex queries, and then certain query patterns that are just turning a query into never come back query. Those are the kind of things we are seeing. And in general, this is especially relevant when you're moving into

100

of gigabytes or terabytes or beyond. It tends to be a lot less relevant when you're like a 10, 20 gigabytes

of your total database size.

And in terms of the use cases that benefit from this parallelism,

the obvious 1 is the data warehousing use case where you're being asked to perform large aggregates

on datasets and maybe just for specific columns within a set of rows. But what are some of the other use cases that can benefit from the increased parallelism and some of the hardware optimizations that you're building into Swarm 64?

Yeah. You've mentioned data warehousing.

That is, of course, the obvious 1. And in all honesty, also,

data warehousing, I think, is a very, very catchall expression because it really moves.

Like, it it's really there's a very wide variety of how you can have your underlying schema design or what kind of queries you're asking. So that's already

a very, very broad field. However, there's also other areas that are quite relevant. So for example, anything that is allowing a kind of user based dashboarding reporting

element.

So, in other words, you may have BI tools. You may have custom custom dashboards. You may have a server software as a service solution that includes some customer interaction.

Let's take your salesforce.com

as an example. Right? These kind of applications, they allow your users to drill down, to aggregate, to find out what is my current status. So in other words, there's a lot of concurrent reporting,

dashboarding

happening.

And these kind of problems,

we are able to address very effectively.

So it's kind of coming back to the point I've mentioned earlier,

many, many concurrent complex

queries. So that's another use case where we see ourselves being, very popular and a very good solution. And then another area is actually kind of more what we call new developments

in the area of geospatial data, for example,

or machine learning. So

just as an example, we did a project with Toyota in Japan, and there, it was around the subject of connected cars, analyzing

geospatial data, and also looking for a certain kind of response time, a certain predictable response time. And we were able to,

keep that response time window for

much, much longer time

than, standard Postgres without the Swarm64 acceleration.

So if you then kind of translated that back into cost, we actually found that we could get away with much less hardware. And as a return or as a result of that, you would basically lower your costs by as much as 70%.

So that's 1 area,

like geospatial data, time series data processing. But again, time series probably in context. Yeah. We're not trying to to be the next timescale DB,

but,

we are allowing people to process time series if they have the need of it in addition to, for example, the geospatial data or the reporting data, etcetera.

And then as I've mentioned, the other area, machine learning,

it's very interesting when you need to have a certain kind of response speed. So that kind of snappiness that Postgres generally has and combine that with

actually feeding in a lot of machine learning data and at the same time pulling out a lot of data to feed your models.

And this is something that we're doing,

with a company called Turbot. They are in the renewables energy space, and they are optimizing,

they're optimizing,

wind turbines

for energy generation and how they're actually positioned. And there,

we're in the area of optimizing wind turbines

and also looking at predictive maintenance cases. So,

these are just some areas. On the 1 side, the big data warehousing space, many, many different use cases in that field, but also things related to dashboarding, reporting,

anything in that field, and then, of course,

any new developments, geospatial data with the immensely powerful Postgres extension

and, the machine learning space. Those are some of the areas that people find us very interesting in. And then

because of the fact that they're able to get this improved performance out of their existing postgres database, it removes the necessity to do a lot of the data copying

and can simplify the overall system design. And I'm wondering what are some of the other benefits that can be realized by keeping all of your data in the 1 database engine. But what are some of the challenges that it poses as well, particularly in the areas of

doing things like data modeling within the database or for the data warehousing use case, being able to generate some of the, history tables so that you can capture changes over time and things like that? Yeah. So that's a very good question. Let me first kind of frame the environment a little bit. So this is very much thinking in

the Postgres world. Right? And what I mean to say there is Postgres is a very schema based database.

Database, the way you work with the database, it's really like the schema is at the heart of it. And things that are maybe schema less are schema less for a certain time and you then use special operators to work with it, and that's kind of a very conscious choice. But in general, if you are comfortable with the world of SQL,

with a world where there's,

defined schemas,

then,

this will be extremely versatile.

And you will be able to process certain elements, like, for example, events and arrays or

or certain schema less elements and documents. That is all possible,

but your your base assumption should be

around a schema based world. And so that's that's something that's quite important.

So if you're in an environment

where you're

willing and comfortable working with explicitly defined schemas,

you will find this extremely versatile.

And as I've already mentioned,

you'll be able to find solutions for all your different,

problems,

like, for example, being able to time travel in your data, being able to audit what has what has happened in terms of changes and so on. Yeah. So I would say if you're thinking Postgres,

if that's a mindset you like,

you will get very far

into all sorts of spaces, data warehousing,

logging,

geospatial data processing,

time series data processing, machine learning. Those kind of things, you'll be able to,

you'll

be able to,

expand into

staying within your comfortable kind of postgres working

paradigms.

I think that is the key I think that is the key qualifier.

So if you're happy with SQL, if you're happy with Postgres,

this is, then extending very naturally.

And increasing the processing throughput of the database

can be beneficial for things that are that are compute intensive,

like being able to parallelize the queries. But how does that shift the overall bottlenecks and impact the disk IO in terms of

the overall throughput of the database? Yeah. I mean, the the obvious thing that always happens when you're,

moving to MPP

is you run into this IO bottleneck you've mentioned. Right? It's suddenly

in many, many queries, it becomes the question, how fast can you fetch the data?

And 1 of the things that we did, and we'll probably touch on some of the other things when we talk a little more about architecture,

but some of the things we did was we created our own storage format, and that is a hybrid row column format. It has some columnar indexing.

And the choice why we went hybrid row column

is because Postgres itself is a row store. And as I always say, it's very difficult to teach a row store

complete column store tricks. Right? Just like you'll you'll end up possibly with the worst of both worlds. So we kind of embraced part of the row store concept and build that kind of row column hybrid,

format

that allows you to still process queries

in kind of adhering to the Postgres logic. We are compressing those. We are making our own kind of Postgres has its little data pages, and we are keeping a bit bigger data pages that are also compressed. So this generally tends to, work very, very well. And then there's some,

columnar indexing, as I've mentioned, to also allow to be a bit smart and not retrieve everything and kind of indiscriminatory,

of the query. So you can be a bit selective.

I would guess maybe some kind of skip reading or

a certain,

range indexes that will probably be the closest thing there. Yeah. And and all that is kept in a in a format, and this format can be processed

by the CPU. But this format could equally be processed or can equally be processed by our hardware extensions,

like, for example, FPGAs.

So we're looking here at 2 things, FPGAs and Smart SSDs that are capable of reading these formats and then doing a lot of processing

of those along the way. And that usually helps you it basically resolves the IO bottleneck with compression, special layout,

selective fetching, and then processing and additional hardware. So digging further into the actual implementation of Swarm 64, you mentioned that it's a plugin for Postgres. But can you talk through sort of more of the technical details of how you approach that and some of the evolution that it's gone through since you first began working on the problem?

So,

what's pretty good is that we came into it

having built already other database extensions.

So we were really

looking into, okay, what were the things, the lessons we've learned?

And we made the conscious choice to stay on an extension level with Postgres. In other words,

we would not go in and build our own Postgres. And as

many of your listeners probably know, a lot of the popular projects and products in the market are actually Postgres derivatives. You have the examples of,

Amazon Redshift. You have the examples of IBM,

Netezza,

or,

for example,

a Pivotal green plan that also once all once upon a time were a version of Postgres that would then kind of taken private and formed into the new project, and we decided to not do that. So we started with looking at, okay, where are extension

hooks that we can use? Where are certain APIs that we can use? And we started

kind of expanding from there. And

Postgres is very, very, versatile in that space. I mean, it's it's probably among the most extensible databases there are, including closed source,

databases. So both open and closed source databases, probably Postgres is among the 1 the 1 that is most extensible. And what you can do is you can define certain ways in how your data is accessed.

Example, custom scan provider.

You can define ways in how your data is stored. We started

with the,

foreign data tables, the foreign data storage engines,

because there was no native storage engine yet at the point we started. Then now is in version 12.

We are very eager to see how this kind of table storage API

will evolve over the future. We may actually

go much more in that direction.

But for now, it's really a combination of defining certain table sources. In our case, the foreign table API

combined with certain access paths that we can define,

certain query planner hooks, we can provide to Postgres certain cost functions. So it's really been designed very, very well in terms of extensibility,

and you can just

provide and kind of offer yourselves to all these different extension hooks. And then your

respective function will be called, and you have the ability to tell

box standard Postgres

about all the great things you can do in addition.

And this is how we worked, and

we would like that a lot. It's not the easiest way of working, but it's in a way the most rewarding because on the 1 side,

you're really

benefiting from the lower,

effort and overhead to move between Postgres versions. And secondly, it was actually very easy for us to support other solutions. For example, our product also works for Enterprise DB, and Enterprise DB's Postgres Advanced Server is actually

not open source. And still, we were able to compile

for Postgres Advanced Server by Enterprise DB, and we're able to run on that. So now you can also use our product in a solution like Enterprise DB. And that would have not been the case if we hadn't gone for such a kind of modular, pluggable,

architecture that Postgres was offering us. Now that is on how we kind of work into the system. Let me just cover a few parts of what we're actually doing. So on the 1 side,

if we kind of take the anatomy of a query, it goes into the system, and we are basically then offering Postgres. In addition to all the different data handling mechanisms

itself,

we're actually offering it additional ways to process the query. So for example, we offer it

to move data around during the query,

so called shuffling, and so the query can stay, parallel for longer. That's 1 of the things we do. We offer Postgres our own join implementation specifically optimized

for joining very large amounts of data. So if you wanna join

tables that have a few 1, 000, 000, 000 rows with tables that have a few million or even a few 1, 000, 000, 000 rows themselves, that is something that can very quickly bring Postgres to its limits. And what we did is we have a special join implementation for that. So that's something that is offered to Postgres, and it can pick it if it wants to. We offer

certain

query rewriting patterns.

So if we can basically

notice

that something is going to be executed very badly because,

for example, it is not going to be

maybe it's a very linear execution mechanism

as opposed to you could do it in in parallel, then we will we will offer that to Postgres, and the Postgres query planner will then pick and choose. Once the query is planned and gets executed,

we have the matching executor nodes to all these things. And, also, we have this accelerated IO I was mentioning

before. And, when it comes to processing,

we can actually offload

sometimes the entire query

to the hardware accelerators.

So there's optional hardware accelerators. You can use FPGAs. You can use Samsung Smart SSDs. And

those FPGAs from Intel or Xilinx or Smart SSDs from Samsung, they will then receive instructions

and process data according to the query and only return the results. And

so all in all,

there is

a host of different functions we're offering to Postgres. The query planner will kind of choose,

like, from a buffet.

And if you have the optional additional hardware acceleration,

it will also offload and push down

a lot of the query processing directly

to additional hardware and making your system thereby even more efficient. And then another element of this equation beyond Postgres is the available hardware. So you mentioned FPGAs and smart SSDs, and I know

Intel Optane persistent memory. And I'm curious how the overall landscape of hardware

availability

has evolved since you first began working on this and some of the challenges that things like the cloud pose for people who are interested in being able to leverage the specialized hardware that you're able to take advantage of? Yeah. That's a that's a very good question, and that's also something I'm happy that the market has really moved in, from our opinion, the right direction. Because when we started with this, I mentioned earlier that we came from a very, very hardware driven world.

And, we were

very early on using,

these FPGAs

as a prototyping platform first for processing data and using database processing. And then many changes happened in the market. On the 1 side, you suddenly had, and Intel has really been on the forefront of that, moving FPGA

devices into the data center. And then, Xilinx also followed,

then Amazon suddenly already years ago introduced an FPGA based instance into their cloud. And then from there onwards,

it's really been step by step by step, and more and more clouds

are enabling

data centers style

or data center grade FPGA accelerator cards. In terms of cloud support

in the context of Swarm 64 because, actually, it now becomes too many to mention everyone

in terms of who's supporting FPGA. But let me just mention the ones that we directly support. So, of course, you've got Amazon. You've got OVH.

It's a large French data center. You've got Nimbix as a US based high performance data center, and it's public knowledge that Azure is coming out with an FPGA instance. So those are just some in the market and and the players that we are focusing on at the moment.

And there you can really get access

to FPGAs,

in the cloud quite easily in the instance type. So it makes it ever more easy to deploy them.

On premise,

those are

those can be obtained

through OEMs. They're basic extension cards. They look like GPUs more or less. Yeah. Just a very, very different profile of what's inside. But if you're just looking at the PCIe card, it looks more or less like a GPU.

So nothing nothing

new and exciting outside of the box. But, of course, it gets quite exciting when you look inside. And then another area of complexity

is because of the fact that you are acting as an extension to postgres, you need to be able to support whatever different versions people are running in their installations.

So while there might be a new feature that simplifies your work in version 12, as you mentioned, the table storage API, you still need to be able to be backwards compatible to whatever Postgres is supporting in order to be able to take advantage of a wider range of adoption. And so I'm wondering how the overall

evolution of Postgres has impacted the product direction and the engineering work necessary on your end to be able to build in all the features that you're trying to support as well as the challenges that you're facing in terms of being able to support the range of versions that are available for people who are running Postgres?

So in general, we are like, sometimes people are directly plugging us into the existing database. But in general, we are proposing a onetime backup and onetime restore. Quite simply because when we are, deploying to our clients, we usually give them a container based deployment that is I know there may be some people that are religious about the teeny tiny bit of performance a containerized approach might cost. But just in terms of ease of deployment, it is it makes it so incredibly easy that we are really that we're really in the predominant amount of cases actually managed to convince the client to do it this way. And to be honest, also, 80, 90% of the clients are already very, very happy with just going with the container. So when you're actually getting Swarm, you will be getting a kind of match set. It will be a box standard Postgres, but of the right version, that we recommend

at that moment

combined with our extension and combined with all the relevant settings you need in a container. And then if you're using an FPGA or Optane,

Persistent Memory on your system, it will also have all the right configuration parameters

to make it really, really easy to deploy to this hardware. So you're getting almost

kind of cloud like comfort there. And we're basically, by by the way, doing the same with all our machine images for the different cloud instances I've been mentioning. So

we really think it's much more convenient to do a 1 time backup and restore and then not fight with any configuration parameters or any details than actually trying to retrofit into kind of every single Postgres version there is out there. However, having said that, we will also make the deployment into

more broadly available Postgres versions easier. So there may be half a year down the line away on how you can extremely quickly just install the extension

into something probably from Postgres 10 or 11 onwards to 12 or 13. Yeah. So a fairly broad window of versions that we will just support out of the box. And just to pick up the detail on your question there with the, Postgres storage engine, we're at the moment not,

utilizing the storage engine because we are actually waiting for how this will evolve. But you're right. Once we've actually made that pivot from

the foreign data to the storage engine, that will actually be forcing us then to basically keep 2 versions maintained. So depending on which post cursor version you're in, you would basically

support us in 1 way or the other. So that comment is is true. But in general, we've been so far, knock on wood, been, quite successful in keeping pace with Postgres. And then the other element of compatibility

is in the other extensions that people want to be able to use alongside

your work at Swarm 64. So I know you already mentioned post GIS, which is 1 of the better known extensions of the ecosystem. But what are some of the other ones that people will commonly look to use alongside Swarm 64,

and what are some that you know to be conflicting that won't work if you they're using your extension? Yeah. Let me let me try to answer that question a little more on a on a high level. So in general, people love using extensions and also something that's extremely popular. It's not only Post JS, but it's also any kind of extended data types. So like custom data types and so on, which is really 1 of the strongholds of Postgres,

of course, are important to support, and that's something we do. And,

that is very, very useful. So I would say custom data types and that kind of custom functionality

around Postgres extensibility,

is really what we see most. Now what does not work with Swarm? In the current version, and there's a change coming, which I will just tease a little bit. But in the current version, as I mentioned, we're keeping our own source of the data. So anything that relies

on how Postgres data is stored will not conflict, but require a workaround. In other words, so what we generally propagate is people to use a mix of what we call native tables. Those are the ones that do not have this columnar storage. It's a standard Postgres tables

and,

also,

some of those accelerated tables, but generally mix and match them just like courses for courses. Now when you then use a solution like, for example, a background backup tool that kind of invisibly just copies pages, That is usually relying on some knowledge about how Postgres data looks, and hence,

it will run into trouble when trying to work with SwarmData.

Similarly, replication schemes that are based on, for example, how the data is stored on disk, again, similar issue. However, we've recognized that for customers, it is sometimes actually quite

useful to be able to just retain the data exactly as they store it. And so in a upcoming product version, we will be looking more into what we call the complete drop in, where people have more of a choice.

They still have the ability to get the extreme acceleration for certain amounts of data. Maybe these are the kind of append only data we were talking earlier about, history tables and things that. They would perfectly fit into our format. However, you may have other data formats

that are probably replicated between multiple Postgres databases, etcetera, where you would choose a different,

storage format.

And this is really where the upcoming product versions will go, and they will allow you to keep your storage format for the cases where it makes complete sense and still give you a higher amount of acceleration,

as well as use the bespoke bespoke analytical storage format for the cases where you want extreme performance.

And digging more into the FPGA

capabilities,

I know that most people are going to be familiar with the concepts of the CPU and the GPU as a coprocessor

and some of the relevant

statistics of those different pieces of hardware for being able to select 1 that will fit well. But for people who aren't familiar with FPGAs or who haven't worked with them closely,

what are some of the benefits that an FPGA can provide, particularly in the database context? And what are some of the specifications

that they should be looking at when they're selecting 1 for installing into their hardware or for deploying into a cloud environment? Yeah. So let me take a quick look at what an FPGA actually is. So it's actually a configurable fabric.

I often say it's like a blank sheet of paper that when it wakes up, it is told what it should be. And it could, for example, be a piece of sheet music playing for, like, a a symphony or something, and it's quite similar. It's like a this blank sheet of paper being configured to be the processing logic you need. And,

how this translates to the area of databases

is that we turn it into a piece of processing logic processing

the individual

data points in your data storage as they move through the chip. So as storage is moved through the FPGA,

we've turned the entire fabric, the entire FPGA,

processing FPGA processing into a custom logic for database processing. Now some people ask us, do you kind of compile every query into a specific configuration for that FPGA. So it does only that. No. We don't. We actually

instead

use the FPGA

with a very kind of SQL specific,

but still quite versatile processing unit that does all the processing as in the compression.

If you're looking at storing data, you compress and finalize with the FPGA, Or if you're looking at reading data, you decompress

and you,

then execute the SQL query. And all that happens

while data is flowing through the FPGA.

You have fantastically fine grained control over how your data moves. And I would say this is probably the single biggest difference in CPUs and GPUs on the 1 side and FPGAs on the other. That because you have that ability to reconfigure, you can make something very custom. And because you can make it custom, you can make sure data moves efficiently.

I enjoy a little bit of GPU programming as a hobby and the challenges you have in making sure that your processing happens

effectively,

you know, the kind of knowledge you need to know about all your kind of cache hierarchies and how data moves, that's something you do not need to consider in the FPGA context

because it's all determined by yourself. You're actually defining how data moves, and, hence, you can make it extremely effective and extremely efficient. So I would say that's 1 of the core elements. And then finally, another very interesting element is this is reconfigurable within split seconds. So 1 of the processing units we, for example, have is a streaming text processor that is capable to find a wild card based strings, so like strings with fuzzy matching inside your data as it moves through. Very effective. But as you can imagine, that takes a little bit of space. So the FPGA being reconfigurable,

you could effectively,

depending on your workload, have those units included or excluded vice versa.

If you have, for example, a nightly or weekly load window, you could reconfigure your FPGA. And that's basically something that our database does in the background. Reconfigure the FPGA to be all

writers, then it does the nightly load, and then it turns back into all readers to do the daily query processing. So those are examples

of what the FPGA is kind of unique versus CPUs and GPUs. It's the ability to really define how does my data flow, and on the other hand, the ability to reconfigure so you can

actually shape shift your device to match your need.

And then for people who are adopting Swarm 64

and particularly

the hybrid row column store, how does that impact the way that they should be thinking about data modeling and the table layout? And, also,

for people who are working with very large datasets, any partitioning

or sharding considerations that they might have? Pretty close to Postgres itself. In general,

you can partition without data, but quite often, it's not needed because often, partitioning is a

requirement to overcome certain performance bottlenecks,

and we don't necessarily require that. However, it is entirely possible to do it. So,

in terms of paradigms, there's nothing new to learn. It's really quite standard Postgres.

It's just another storage format you can choose and where you can get additional benefit. Having said all that, this storage is really kind of an expert option. It is to get the best possible performance. When people use our product, they will already get a benefit from all the other features, and

the kind of additional storage is kind of the icing on the cake. So what we generally recommend our customers is start slowly and work your way into it. We don't propagate any big migrations, any big changes, in particularly when you're coming from Postgres. There's usually a few small tweaks you can do, and you will see dramatic differences.

In terms of your experience

of building this product and growing the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned? 1 of the things I I found, really, really interesting is to see how our customers and also how our excellent,

solution engineering team has actually solved some of the things. And it shows you kind of the boundaries of your product

and it being used in ways it wasn't really intended to be used, which was really fun to see. So let me just give you 1 specific example. There was an issue around a query processing speed, and that was actually already a year and a half ago. So it's actually quite much earlier version of the product. But, essentially, the way how the customer and, 1 of our solution engineers got around the problem was they actually

turned everything into a Swarm 64

based

table format like I've been describing earlier. And this was really a transactional table it started from, but it was so cheap to make that secondary copy because

everything was

very, very fast ingested and

accelerated by the FPGA.

And then it was also very, very cheap to process once it wasn't that format that actually the entire round trip was still significantly

accelerating the query. So that was kind of really unexpected. If you think about it, okay, I'm having a table

and I'm doing a very heavy operation on it. And then no. Wait a second. You actually don't. You take a copy of the k table and then you take and basically take the operation on the copy of the table and you're still faster

than processing the original table in the first place. That that was quite fascinating actually to see that happen. So that was really a, a kind of learning point. Now in a way, that was also a little quirky.

So with our newer versions, we would now recommend a different design.

But, you know, in the end of the day, it's really, really fun seeing your product being used and, seeing some of the performance benefits being put to quite unexpected users.

And when is Swarm 64 the wrong choice and somebody might be better suited either just using Vanilla Postgres or some of the other plugins in the ecosystem or migrating to a different set of database technologies?

So generally, if your problem

is small or your system is small, I think

we're probably not the right choice.

Example,

some people say, oh, I'm I'm I'm running a big database server, and what they really mean is they have 4 or 8 physical cores and then 8 or 16

physical threads. And this is really the kind of level where added parallelism

becomes a little pointless because, you know, there isn't that many cores to go around in the first place. So what do you wanna paralyze?

Similarly, Postgres itself

performs pretty well

even with these kind of challenging industry benchmarks if you're moving into areas like 10 gigabytes or 30 gigabytes of data.

So,

I wouldn't say

for anything in that range, Swarm would really be relevant. But it can some sometimes already be relevant for 100 gigabytes of data. And then as you move up from there, like into terabytes,

into tens of terabytes, 100 of terabytes,

That's definitely a range

where we feel very, very comfortable.

So too small a system or too small a problem or often the combination of the

2, that's something that's definitely,

not so suited for us. And then the other part,

as I've mentioned, is,

we're not trying to introduce a new tool

and

kind of invent

any new paradigms.

So you should be looking at Postgres as in

maybe using Postgres already, maybe considering Postgres. I think this is also a kind of qualifying criterion, so to say. If you really wanna work on something

that is, for example, NoSQL style,

you know, you shouldn't be looking at a Postgres extension.

Right?

So that is, I think, another

point.

However, it doesn't mean you have to be in Postgres already. We find people who are looking at Postgres

coming from those proprietary data warehouses we've been talking about in the beginning. And for those, we can actually be an excellent choice. And then as you look to the near and medium term of the business and the technologies and the evolution of the Postgres ecosystem,

what are some of the things that you have planned?

So in general, this notion of becoming more and more invisible,

I would say that's kind of the overarching concept.

So

if I kind of imagine, and I'm looking at where we are a year from now, is you start with a server

and you add some obtained DC and you press

installing extension.

There's SWAMP 64

using the Optane persistent RAM

from Intel,

and it will just be

doing everything.

Invisibly,

you will get the acceleration.

The same with an FPGA card, maybe on a cloud instance, you choose now, okay, I wanna use a certain FPGA enabled cloud instance,

or I'm installing an FPGA card into my server or I'm buying a new server that already has an FPGA card,

say, Intel Xilinx

or

a smart SSD

or an array of smart SSDs drives from Samsung.

And then all you do is you install the extension.

We detect the hardware. We adjust our pattern. That's really that's really where I'm seeing ourselves going in future. So we've been able to show very, very good performance with the product we have now. It can be

dramatically different, like the 50 x on some queries.

Usually, you see 10 to 20 x depending on your your workload. So big, big acceleration.

That's great, but we now wanna make it easier and easier to use, and that's really where I see ourselves going. So you're adding hardware or you're ordering a server that has new hardware.

And then using our extension, you'll be able to use it very effectively,

and it will kind of all fall into place

behind the scenes. And we've

got some pretty promising,

prototypes of that running in our lab.

And so I'm very confident we'll go that way,

and we'll become more and more invisible

apart from, of course, the massive performance differences that we wanna make for our users.

Are there any other aspects of the work that you're doing at Swarm 64

or the Postgres ecosystem

or some of the analytical use cases that we've highlighted that we didn't discuss that you'd like to cover before we close out the show? Well, 1 thing I want to to mention is a is a big shout out to the community.

We've managed to get our first patch through. So this will,

this has now been been pushed, which was great. This is about making it easier to back up also foreign tables in the Postgres environment. So that will go into 1 of the new upstream version of Postgres. Should be there in version 13. So big out big shout out to the community for that. And in general, we're seeing ourselves as a member of that community. So we're looking all the time at, okay, what can be contributed.

We're also looking,

very much into the initiatives of the community around this persistent RAM obtained DC and, of course, FPGAs and accelerators. So big shout out to all the companies there in the Postgres ecosystem

that makes it

a lot of fun to be there because you've got so much support for this database.

So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll definitely have you add your contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Okay. I would say, actually, that a really, really powerful open source visual BI tool

that kind of interacts with these different databases,

I think that is something that could be quite transformatory.

So

think about an open source Tableau and and with that kind of power and and capabilities.

I don't wanna discount any any of the of the projects that are that are out there,

but I think there's definitely

room for

1 of those existing projects to grow into

really

feature rich and easy to use visualizers that just connect to different database back ends and then just just run. So,

and maybe overlooking something obvious, but, from all the tools we've been using in Swarm, all the ones open source, we didn't find something that is quite as powerful

as some of the, proprietary offerings out there. So that is maybe something that could be quite transformatory in getting people into thinking more about data management and utilizing their database in the context of the tools to the maximum than they are using it today. Yeah. I can definitely agree with that. That there are a bunch of great point solutions or great systems that have a lot of features, but aren't necessarily very accessible to people who don't wanna dig into a lot of custom development for it. So, I'll I'll second your point on that. So thank you very much for taking the time today to join me and discuss the work that you've been doing with Swarm 64 and trying to optimize the capabilities of Postgres and simplify people's use cases there. It's definitely a very interesting project. So I thank you for the work you're doing, and I hope you enjoy the rest of your day. Thank you very much. It's great to talk to you.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links