Summary
The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.
- Your host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Swarm64 is?
- How did the business get started and what keeps you motivated?
- What are some of the common bottlenecks that users of postgres run into?
- What are the use cases and workloads that gain the most benefit from increased parallelism in the database engine?
- By increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer?
- Can you describe how Swarm64 is implemented?
- How has the product evolved since you first began working on it?
- How has the evolution of postgres impacted your product direction?
- What are some of the notable challenges that you have dealt with as a result of upstream changes in postgres?
- How has the hardware landscape evolved and how does that affect your prioritization of features and improvements?
- What are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64?
- Which extensions conflict with yours and how does that impact potential adoption?
- In addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture?
- What are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA?
- Most people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA?
- For users who are adopting Swarm64, how does it impact the way they should be thinking of their data models?
- What is involved in migrating an existing database to use Swarm64?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the product and business of Swarm64?
- When is Swarm64 the wrong choice?
- What do you have planned for the future of Swarm64?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Swarm64
- Lufthansa Cargo
- IBM Cognos Analytics
- OLAP Cube
- PostgreSQL
- Geospatial Data
- TimescaleDB
- FPGA == Field Programmable Gate Array
- Greenplum
- Foreign Data Tables
- PostgreSQL Table Storage API
- EnterpriseDB
- Xilinx
- OVH Cloud
- Nimbix
- Azure
- Tableau
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/lunode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data? Tidy Data is the DataOps monitoring platform that you've been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, PagerDuty, and custom webhooks, you can fix the errors before they become a problem. Go to data engineering podcast.com/tidydata today and get started for free with no credit card required.
Your host is Tobias Macy. And today, I'm interviewing Thomas Richter about Swarm 64, a postgreSQL extension to improve parallelism and add support for FPGAs.
[00:01:28] Unknown:
So, Thomas, can you start by introducing yourself? Yeah. Hi. My name is Thomas. I'm CEO and cofounder of Swarm 64. And I'm a strange beast because I live at the intersection of of business and data and data management programming. So that's what I do and enjoy very much. And do you remember how you first got involved in the area of data management? So probably the the first real exposure to enterprise grade data management and data wrangling was as an intern almost 20 years ago when I was working at, Lufthansa Cargo. It's a German national airline. And in their cargo department, they did something that today you would probably call data science. Back then, they called it sales steering. And I basically pulled out data out of a large IBM Cognos based data warehouse and, all the beauty of OLAP cubes and the like. So that was my first exposure to that space. And I've since been always at this kind of intersection point, as as I mentioned. So I very much enjoy basing business decisions on a vision of the truth. And I think the most objective vision 1 can obtain is really looking at the data and the underlying, basically, the underlying effects.
And then you can make much smarter decisions because you basically are looking at to prove an hypothesis as opposed to just argue opinions. So I've been through my career, always been at that kind of cross section. And when I had the opportunity to found something in the data space, I was very excited about it. And so we basically built Swarm 64.
[00:03:02] Unknown:
So can you describe a bit more about what Swarm 64 is and some of the work that you're doing there?
[00:03:08] Unknown:
So Swarm 64 is an extension for the usually popular Postgres database. And I think to your listeners, Postgres will not be a new concept. Right? It's very widely very widely adopted and mutually popular. And what we do is we extend into it, and we are basically accelerating it for reporting, analytics, time series, stereo spatial workloads, and also for hybrid workloads that include transactional and analytical components. So that's what we do. And can you give us some of the backstory of how the business got started and what it is about it that keeps you motivated and keeps you continuing to invest your time and energy with it? Yeah. That's a that's a very good question. And it's also quite an interesting journey that we took. So when we started this, and this was I mean, that sounds horribly stereotypical, but this was actually started in the Berlin co working space. Yeah. So we really, my cofounder and I met at a co working space, and we started to go at it initially very much from the hardware angle. So my cofounder had developed some of the earliest mobile GPUs, and we were basically, looking at data processing from a hardware angle. And as we evolved, we learned from interacting with our customers that everybody wants a full solution.
You don't want to have some kind of piece that you have to puzzle together. You really enjoy having a full solution. And for us, really, Postgres then came in naturally as a system we could accelerate. And not only with hardware, that was our original take, but also we've built a stronger and stronger software component to it. So as I will be explaining later, you now have the choice between software and hardware components as you wanna add them as options. And so, yeah, that's how we started. And I think the part that, I particularly enjoy about where we've come since we started this is that we're now in a situation where we can really challenge some market players.
And I'm talking about the big proprietary databases that are really good products, but they're also very expensive. And especially in the area of data warehousing, we can now lift Postgres, which has already a fantastic feature set, to a level of performance that it can suddenly compete. And this act of moving open source into spaces where previously only proprietary solutions could address the business problems, that's something I find very rewarding because it's a little bit like playing Rocky Balboa. You know, you're like the small guy and you're going in and you're playing there, and you're kind of fighting to win the title against some of those really heavyweight champions.
And I find that quite rewarding. It's a big challenge, but that's kind of where the fun is as well. And in terms of
[00:06:02] Unknown:
the bottlenecks that exist in the open source postgres, what are some of the common ones that users run into that prevent them from being able to use it for the full range of use cases that they might be intending to, and what would lead them to maybe move their analytical workloads into a data warehouse or a data lake versus continuing to run them in process with the Postgres engine and the Postgres database that they've been building up for their applications and their business?
[00:06:29] Unknown:
Yeah. So I think this is actually already, very well framed because Postgres in itself, I mean, as we all know, has been around for 30 plus years, and it's a really mature and powerful product. However, I would say it has a blind spot in the area of parallelism and some of the things that that, hang together with it. And when I talk about parallelism here, I talk about the ability to deploy those modern multi core systems and deploy many, many cores, like tens or hundreds to single problem. The kind of, MPP style processing that those proprietary products, already master, Postgres kind of got as an a bit of an afterthought.
So if you look at this this feature called query parallelism that was added in Postgres 9.6, that's already approximately 20 years down Postgres history lane. Right? So it's something that has been added very late in the development cycle of the database. And whereas it's a it's a great extension, we love it being there. It is really not going as far as we personally believe it should do, and that's why we are extending it. So query parallelism is 1 of the bottlenecks usually. When you are finding it difficult to deploy a lot of your cores in your multicore system to your Postgres queries, then Swarm 64 can probably help you.
Similarly, scanning large amounts of data, that are not lending themselves to an index, quite simply because indexes are great if you try to find the needle in the haystack. But what if you're not trying to find the needle in the haystack? What if you're trying to scan a range that is effectively a third of your table? Again, Postgres isn't very fast at scanning, so it will really hurt when you try to run these kind of queries. Then another area is, of course, the concurrency of complex queries. So you have queries that fall into the first or the second category I've just been describing, and then you try to run multiple of them in parallel, and you will see how your individual Postgres workers are kind of scrambling for IO and kind of competing with each other. This is something that Swarm also addresses. So complex query concurrency is something that, is also a challenge we see in the field. And finally, and this is true for any database, we're just trying to, you know, help and and contribute to it.
Certain query patterns are difficult to process, and there is always the question about, okay, should the user rewrite it, or should you provide some additional, intelligence, for example, to execute certain anti join smarter and things like that? This is really, a kind of a never ending debate. Now the default choice would usually be to rewrite the queries, but there is often a scenario where this is not desirable or, where this is just not an option because the queries could, for example, come from an application that the user can't touch. So query patterns that are difficult to process are kind of the 4th element. So in summary, query parallelism, scanning large amounts of data, many concurrent complex queries and query patterns that are difficult to process, those are kind of the 4 areas where we see a lot of challenges in Postgres when you try to scale it to a large degree.
And when I say a large degree, I mean, we're talking about at least 16 24 threads like 12, 8 to 12 cores. Right? We're not talking here about your, little system database running on 10% of the server, having a size maybe of 1 or 2 gigabytes. We're looking at larger problems, 100 gigabytes, terabyte range, something like that.
[00:10:11] Unknown:
And in terms of Postgres, it's a common database for application workloads and for being able to do some lightweight analytics. But what are some of the common bottlenecks that users run into that prevent them from being able to use it for all their different use cases, and that might lead them to use a dedicated data warehouse or a data lake engine for being able to do more intensive analytic workloads?
[00:10:34] Unknown:
Yeah. Thanks. That's a very good question. As said and as you already framed in your question, it's it's like Postgres itself is extremely versatile. But it usually struggles, as the data as the quantity of data grows. And generally, it tends to gravitate around 4 different areas. So the first 1 is related to query parallelism. So parallelism has been kind of added to Postgres when it was already something like 20 odd years old. So you're looking at, the so called query parallelism feature being added in version 9.6. So we're now at version 12. So this is only actually a few versions ago. And that means that when Postgres executes, it doesn't utilize modern multicore systems quite to the degree as you would in a data warehousing context where you're usually working with this MPP, massively parallel processing paradigm. So Postgres is kind of holding itself back a bit, and it's also missing some of the features to move data during the query process to actually keep the query parallel for a very long time. The second part is that scanning large amounts of data that are not lending itself to an index is a challenge for Postgres.
When you're implementing an index, you're usually solving the kind of problems where you're finding a needle in the haystack or a few needles in the haystack. Whereas, when you have a query that requests to scan effectively a third of your table, indexes won't help you much. And this is really where, Postgres will then struggle. The other part is, what if you have many complex queries? And they could be of the first or the second kind I've just been describing concurrently. Like, many concurrent complex queries, they will really push the limits on your kind of storage bottlenecks and the way you retrieve the data, the way data propagates through Postgres. So that's another area where we are seeing bottlenecks.
And finally, my 4th point is that no database is perfect as in being able to execute every query in a perfect way. However, there are certain query patterns from the domain of data warehousing that are much well, they really require a little bit of special processing, a little bit of optimization, for example, handling anti joins differently and things like that. And this is really where there's the query patterns that can really turn Postgres queries into so called never come back queries. So those are 4 areas, query parallelism, the scanning of large amounts of data, many concurrent complex queries, and then certain query patterns that are just turning a query into never come back query. Those are the kind of things we are seeing. And in general, this is especially relevant when you're moving into 100 of gigabytes or terabytes or beyond. It tends to be a lot less relevant when you're like a 10, 20 gigabytes of your total database size.
[00:13:24] Unknown:
And in terms of the use cases that benefit from this parallelism, the obvious 1 is the data warehousing use case where you're being asked to perform large aggregates on datasets and maybe just for specific columns within a set of rows. But what are some of the other use cases that can benefit from the increased parallelism and some of the hardware optimizations that you're building into Swarm 64?
[00:13:48] Unknown:
Yeah. You've mentioned data warehousing. That is, of course, the obvious 1. And in all honesty, also, data warehousing, I think, is a very, very catchall expression because it really moves. Like, it it's really there's a very wide variety of how you can have your underlying schema design or what kind of queries you're asking. So that's already a very, very broad field. However, there's also other areas that are quite relevant. So for example, anything that is allowing a kind of user based dashboarding reporting element.
So, in other words, you may have BI tools. You may have custom custom dashboards. You may have a server software as a service solution that includes some customer interaction. Let's take your salesforce.com as an example. Right? These kind of applications, they allow your users to drill down, to aggregate, to find out what is my current status. So in other words, there's a lot of concurrent reporting, dashboarding happening. And these kind of problems, we are able to address very effectively. So it's kind of coming back to the point I've mentioned earlier, many, many concurrent complex queries. So that's another use case where we see ourselves being, very popular and a very good solution. And then another area is actually kind of more what we call new developments in the area of geospatial data, for example, or machine learning. So just as an example, we did a project with Toyota in Japan, and there, it was around the subject of connected cars, analyzing geospatial data, and also looking for a certain kind of response time, a certain predictable response time. And we were able to, keep that response time window for much, much longer time than, standard Postgres without the Swarm64 acceleration.
So if you then kind of translated that back into cost, we actually found that we could get away with much less hardware. And as a return or as a result of that, you would basically lower your costs by as much as 70%. So that's 1 area, like geospatial data, time series data processing. But again, time series probably in context. Yeah. We're not trying to to be the next timescale DB, but, we are allowing people to process time series if they have the need of it in addition to, for example, the geospatial data or the reporting data, etcetera. And then as I've mentioned, the other area, machine learning, it's very interesting when you need to have a certain kind of response speed. So that kind of snappiness that Postgres generally has and combine that with actually feeding in a lot of machine learning data and at the same time pulling out a lot of data to feed your models.
And this is something that we're doing, with a company called Turbot. They are in the renewables energy space, and they are optimizing, they're optimizing, wind turbines for energy generation and how they're actually positioned. And there, we're in the area of optimizing wind turbines and also looking at predictive maintenance cases. So, these are just some areas. On the 1 side, the big data warehousing space, many, many different use cases in that field, but also things related to dashboarding, reporting, anything in that field, and then, of course, any new developments, geospatial data with the immensely powerful Postgres extension and, the machine learning space. Those are some of the areas that people find us very interesting in. And then
[00:17:21] Unknown:
because of the fact that they're able to get this improved performance out of their existing postgres database, it removes the necessity to do a lot of the data copying and can simplify the overall system design. And I'm wondering what are some of the other benefits that can be realized by keeping all of your data in the 1 database engine. But what are some of the challenges that it poses as well, particularly in the areas of doing things like data modeling within the database or for the data warehousing use case, being able to generate some of the, history tables so that you can capture changes over time and things like that? Yeah. So that's a very good question. Let me first kind of frame the environment a little bit. So this is very much thinking in
[00:18:03] Unknown:
the Postgres world. Right? And what I mean to say there is Postgres is a very schema based database. Database, the way you work with the database, it's really like the schema is at the heart of it. And things that are maybe schema less are schema less for a certain time and you then use special operators to work with it, and that's kind of a very conscious choice. But in general, if you are comfortable with the world of SQL, with a world where there's, defined schemas, then, this will be extremely versatile. And you will be able to process certain elements, like, for example, events and arrays or or certain schema less elements and documents. That is all possible, but your your base assumption should be around a schema based world. And so that's that's something that's quite important.
So if you're in an environment where you're willing and comfortable working with explicitly defined schemas, you will find this extremely versatile. And as I've already mentioned, you'll be able to find solutions for all your different, problems, like, for example, being able to time travel in your data, being able to audit what has what has happened in terms of changes and so on. Yeah. So I would say if you're thinking Postgres, if that's a mindset you like, you will get very far into all sorts of spaces, data warehousing, logging, geospatial data processing,
[00:19:38] Unknown:
time series data processing, machine learning. Those kind of things, you'll be able to,
[00:19:40] Unknown:
you'll be able to, expand into staying within your comfortable kind of postgres working paradigms. I think that is the key I think that is the key qualifier. So if you're happy with SQL, if you're happy with Postgres, this is, then extending very naturally.
[00:19:59] Unknown:
And increasing the processing throughput of the database can be beneficial for things that are that are compute intensive, like being able to parallelize the queries. But how does that shift the overall bottlenecks and impact the disk IO in terms of
[00:20:16] Unknown:
the overall throughput of the database? Yeah. I mean, the the obvious thing that always happens when you're, moving to MPP is you run into this IO bottleneck you've mentioned. Right? It's suddenly in many, many queries, it becomes the question, how fast can you fetch the data? And 1 of the things that we did, and we'll probably touch on some of the other things when we talk a little more about architecture, but some of the things we did was we created our own storage format, and that is a hybrid row column format. It has some columnar indexing. And the choice why we went hybrid row column is because Postgres itself is a row store. And as I always say, it's very difficult to teach a row store complete column store tricks. Right? Just like you'll you'll end up possibly with the worst of both worlds. So we kind of embraced part of the row store concept and build that kind of row column hybrid, format that allows you to still process queries in kind of adhering to the Postgres logic. We are compressing those. We are making our own kind of Postgres has its little data pages, and we are keeping a bit bigger data pages that are also compressed. So this generally tends to, work very, very well. And then there's some, columnar indexing, as I've mentioned, to also allow to be a bit smart and not retrieve everything and kind of indiscriminatory, of the query. So you can be a bit selective.
I would guess maybe some kind of skip reading or a certain, range indexes that will probably be the closest thing there. Yeah. And and all that is kept in a in a format, and this format can be processed by the CPU. But this format could equally be processed or can equally be processed by our hardware extensions, like, for example, FPGAs. So we're looking here at 2 things, FPGAs and Smart SSDs that are capable of reading these formats and then doing a lot of processing of those along the way. And that usually helps you it basically resolves the IO bottleneck with compression, special layout,
[00:22:22] Unknown:
selective fetching, and then processing and additional hardware. So digging further into the actual implementation of Swarm 64, you mentioned that it's a plugin for Postgres. But can you talk through sort of more of the technical details of how you approach that and some of the evolution that it's gone through since you first began working on the problem?
[00:22:42] Unknown:
So, what's pretty good is that we came into it having built already other database extensions. So we were really looking into, okay, what were the things, the lessons we've learned? And we made the conscious choice to stay on an extension level with Postgres. In other words, we would not go in and build our own Postgres. And as many of your listeners probably know, a lot of the popular projects and products in the market are actually Postgres derivatives. You have the examples of, Amazon Redshift. You have the examples of IBM, Netezza, or, for example, a Pivotal green plan that also once all once upon a time were a version of Postgres that would then kind of taken private and formed into the new project, and we decided to not do that. So we started with looking at, okay, where are extension hooks that we can use? Where are certain APIs that we can use? And we started kind of expanding from there. And Postgres is very, very, versatile in that space. I mean, it's it's probably among the most extensible databases there are, including closed source, databases. So both open and closed source databases, probably Postgres is among the 1 the 1 that is most extensible. And what you can do is you can define certain ways in how your data is accessed.
Example, custom scan provider. You can define ways in how your data is stored. We started with the, foreign data tables, the foreign data storage engines, because there was no native storage engine yet at the point we started. Then now is in version 12. We are very eager to see how this kind of table storage API will evolve over the future. We may actually go much more in that direction. But for now, it's really a combination of defining certain table sources. In our case, the foreign table API combined with certain access paths that we can define, certain query planner hooks, we can provide to Postgres certain cost functions. So it's really been designed very, very well in terms of extensibility, and you can just provide and kind of offer yourselves to all these different extension hooks. And then your respective function will be called, and you have the ability to tell box standard Postgres about all the great things you can do in addition.
And this is how we worked, and we would like that a lot. It's not the easiest way of working, but it's in a way the most rewarding because on the 1 side, you're really benefiting from the lower, effort and overhead to move between Postgres versions. And secondly, it was actually very easy for us to support other solutions. For example, our product also works for Enterprise DB, and Enterprise DB's Postgres Advanced Server is actually not open source. And still, we were able to compile for Postgres Advanced Server by Enterprise DB, and we're able to run on that. So now you can also use our product in a solution like Enterprise DB. And that would have not been the case if we hadn't gone for such a kind of modular, pluggable, architecture that Postgres was offering us. Now that is on how we kind of work into the system. Let me just cover a few parts of what we're actually doing. So on the 1 side, if we kind of take the anatomy of a query, it goes into the system, and we are basically then offering Postgres. In addition to all the different data handling mechanisms itself, we're actually offering it additional ways to process the query. So for example, we offer it to move data around during the query, so called shuffling, and so the query can stay, parallel for longer. That's 1 of the things we do. We offer Postgres our own join implementation specifically optimized for joining very large amounts of data. So if you wanna join tables that have a few 1, 000, 000, 000 rows with tables that have a few million or even a few 1, 000, 000, 000 rows themselves, that is something that can very quickly bring Postgres to its limits. And what we did is we have a special join implementation for that. So that's something that is offered to Postgres, and it can pick it if it wants to. We offer certain query rewriting patterns.
So if we can basically notice that something is going to be executed very badly because, for example, it is not going to be maybe it's a very linear execution mechanism as opposed to you could do it in in parallel, then we will we will offer that to Postgres, and the Postgres query planner will then pick and choose. Once the query is planned and gets executed, we have the matching executor nodes to all these things. And, also, we have this accelerated IO I was mentioning before. And, when it comes to processing, we can actually offload sometimes the entire query to the hardware accelerators.
So there's optional hardware accelerators. You can use FPGAs. You can use Samsung Smart SSDs. And those FPGAs from Intel or Xilinx or Smart SSDs from Samsung, they will then receive instructions and process data according to the query and only return the results. And so all in all, there is a host of different functions we're offering to Postgres. The query planner will kind of choose, like, from a buffet. And if you have the optional additional hardware acceleration, it will also offload and push down a lot of the query processing directly to additional hardware and making your system thereby even more efficient. And then another element of this equation beyond Postgres is the available hardware. So you mentioned FPGAs and smart SSDs, and I know
[00:28:30] Unknown:
Intel Optane persistent memory. And I'm curious how the overall landscape of hardware availability
[00:28:37] Unknown:
has evolved since you first began working on this and some of the challenges that things like the cloud pose for people who are interested in being able to leverage the specialized hardware that you're able to take advantage of? Yeah. That's a that's a very good question, and that's also something I'm happy that the market has really moved in, from our opinion, the right direction. Because when we started with this, I mentioned earlier that we came from a very, very hardware driven world. And, we were very early on using, these FPGAs as a prototyping platform first for processing data and using database processing. And then many changes happened in the market. On the 1 side, you suddenly had, and Intel has really been on the forefront of that, moving FPGA devices into the data center. And then, Xilinx also followed, then Amazon suddenly already years ago introduced an FPGA based instance into their cloud. And then from there onwards, it's really been step by step by step, and more and more clouds are enabling data centers style or data center grade FPGA accelerator cards. In terms of cloud support in the context of Swarm 64 because, actually, it now becomes too many to mention everyone in terms of who's supporting FPGA. But let me just mention the ones that we directly support. So, of course, you've got Amazon. You've got OVH.
It's a large French data center. You've got Nimbix as a US based high performance data center, and it's public knowledge that Azure is coming out with an FPGA instance. So those are just some in the market and and the players that we are focusing on at the moment. And there you can really get access to FPGAs, in the cloud quite easily in the instance type. So it makes it ever more easy to deploy them. On premise, those are those can be obtained through OEMs. They're basic extension cards. They look like GPUs more or less. Yeah. Just a very, very different profile of what's inside. But if you're just looking at the PCIe card, it looks more or less like a GPU. So nothing nothing new and exciting outside of the box. But, of course, it gets quite exciting when you look inside. And then another area of complexity
[00:30:54] Unknown:
is because of the fact that you are acting as an extension to postgres, you need to be able to support whatever different versions people are running in their installations. So while there might be a new feature that simplifies your work in version 12, as you mentioned, the table storage API, you still need to be able to be backwards compatible to whatever Postgres is supporting in order to be able to take advantage of a wider range of adoption. And so I'm wondering how the overall evolution of Postgres has impacted the product direction and the engineering work necessary on your end to be able to build in all the features that you're trying to support as well as the challenges that you're facing in terms of being able to support the range of versions that are available for people who are running Postgres?
[00:31:40] Unknown:
So in general, we are like, sometimes people are directly plugging us into the existing database. But in general, we are proposing a onetime backup and onetime restore. Quite simply because when we are, deploying to our clients, we usually give them a container based deployment that is I know there may be some people that are religious about the teeny tiny bit of performance a containerized approach might cost. But just in terms of ease of deployment, it is it makes it so incredibly easy that we are really that we're really in the predominant amount of cases actually managed to convince the client to do it this way. And to be honest, also, 80, 90% of the clients are already very, very happy with just going with the container. So when you're actually getting Swarm, you will be getting a kind of match set. It will be a box standard Postgres, but of the right version, that we recommend at that moment combined with our extension and combined with all the relevant settings you need in a container. And then if you're using an FPGA or Optane, Persistent Memory on your system, it will also have all the right configuration parameters to make it really, really easy to deploy to this hardware. So you're getting almost kind of cloud like comfort there. And we're basically, by by the way, doing the same with all our machine images for the different cloud instances I've been mentioning. So we really think it's much more convenient to do a 1 time backup and restore and then not fight with any configuration parameters or any details than actually trying to retrofit into kind of every single Postgres version there is out there. However, having said that, we will also make the deployment into more broadly available Postgres versions easier. So there may be half a year down the line away on how you can extremely quickly just install the extension into something probably from Postgres 10 or 11 onwards to 12 or 13. Yeah. So a fairly broad window of versions that we will just support out of the box. And just to pick up the detail on your question there with the, Postgres storage engine, we're at the moment not, utilizing the storage engine because we are actually waiting for how this will evolve. But you're right. Once we've actually made that pivot from the foreign data to the storage engine, that will actually be forcing us then to basically keep 2 versions maintained. So depending on which post cursor version you're in, you would basically support us in 1 way or the other. So that comment is is true. But in general, we've been so far, knock on wood, been, quite successful in keeping pace with Postgres. And then the other element of compatibility
[00:34:15] Unknown:
is in the other extensions that people want to be able to use alongside your work at Swarm 64. So I know you already mentioned post GIS, which is 1 of the better known extensions of the ecosystem. But what are some of the other ones that people will commonly look to use alongside Swarm 64,
[00:34:32] Unknown:
and what are some that you know to be conflicting that won't work if you they're using your extension? Yeah. Let me let me try to answer that question a little more on a on a high level. So in general, people love using extensions and also something that's extremely popular. It's not only Post JS, but it's also any kind of extended data types. So like custom data types and so on, which is really 1 of the strongholds of Postgres, of course, are important to support, and that's something we do. And, that is very, very useful. So I would say custom data types and that kind of custom functionality around Postgres extensibility, is really what we see most. Now what does not work with Swarm? In the current version, and there's a change coming, which I will just tease a little bit. But in the current version, as I mentioned, we're keeping our own source of the data. So anything that relies on how Postgres data is stored will not conflict, but require a workaround. In other words, so what we generally propagate is people to use a mix of what we call native tables. Those are the ones that do not have this columnar storage. It's a standard Postgres tables and, also, some of those accelerated tables, but generally mix and match them just like courses for courses. Now when you then use a solution like, for example, a background backup tool that kind of invisibly just copies pages, That is usually relying on some knowledge about how Postgres data looks, and hence, it will run into trouble when trying to work with SwarmData.
Similarly, replication schemes that are based on, for example, how the data is stored on disk, again, similar issue. However, we've recognized that for customers, it is sometimes actually quite useful to be able to just retain the data exactly as they store it. And so in a upcoming product version, we will be looking more into what we call the complete drop in, where people have more of a choice. They still have the ability to get the extreme acceleration for certain amounts of data. Maybe these are the kind of append only data we were talking earlier about, history tables and things that. They would perfectly fit into our format. However, you may have other data formats that are probably replicated between multiple Postgres databases, etcetera, where you would choose a different, storage format.
And this is really where the upcoming product versions will go, and they will allow you to keep your storage format for the cases where it makes complete sense and still give you a higher amount of acceleration, as well as use the bespoke bespoke analytical storage format for the cases where you want extreme performance.
[00:37:08] Unknown:
And digging more into the FPGA capabilities, I know that most people are going to be familiar with the concepts of the CPU and the GPU as a coprocessor and some of the relevant statistics of those different pieces of hardware for being able to select 1 that will fit well. But for people who aren't familiar with FPGAs or who haven't worked with them closely, what are some of the benefits that an FPGA can provide, particularly in the database context? And what are some of the specifications that they should be looking at when they're selecting 1 for installing into their hardware or for deploying into a cloud environment? Yeah. So let me take a quick look at what an FPGA actually is. So it's actually a configurable fabric.
[00:37:53] Unknown:
I often say it's like a blank sheet of paper that when it wakes up, it is told what it should be. And it could, for example, be a piece of sheet music playing for, like, a a symphony or something, and it's quite similar. It's like a this blank sheet of paper being configured to be the processing logic you need. And, how this translates to the area of databases is that we turn it into a piece of processing logic processing the individual data points in your data storage as they move through the chip. So as storage is moved through the FPGA, we've turned the entire fabric, the entire FPGA, processing FPGA processing into a custom logic for database processing. Now some people ask us, do you kind of compile every query into a specific configuration for that FPGA. So it does only that. No. We don't. We actually instead use the FPGA with a very kind of SQL specific, but still quite versatile processing unit that does all the processing as in the compression.
If you're looking at storing data, you compress and finalize with the FPGA, Or if you're looking at reading data, you decompress and you, then execute the SQL query. And all that happens while data is flowing through the FPGA. You have fantastically fine grained control over how your data moves. And I would say this is probably the single biggest difference in CPUs and GPUs on the 1 side and FPGAs on the other. That because you have that ability to reconfigure, you can make something very custom. And because you can make it custom, you can make sure data moves efficiently. I enjoy a little bit of GPU programming as a hobby and the challenges you have in making sure that your processing happens effectively, you know, the kind of knowledge you need to know about all your kind of cache hierarchies and how data moves, that's something you do not need to consider in the FPGA context because it's all determined by yourself. You're actually defining how data moves, and, hence, you can make it extremely effective and extremely efficient. So I would say that's 1 of the core elements. And then finally, another very interesting element is this is reconfigurable within split seconds. So 1 of the processing units we, for example, have is a streaming text processor that is capable to find a wild card based strings, so like strings with fuzzy matching inside your data as it moves through. Very effective. But as you can imagine, that takes a little bit of space. So the FPGA being reconfigurable, you could effectively, depending on your workload, have those units included or excluded vice versa.
If you have, for example, a nightly or weekly load window, you could reconfigure your FPGA. And that's basically something that our database does in the background. Reconfigure the FPGA to be all writers, then it does the nightly load, and then it turns back into all readers to do the daily query processing. So those are examples of what the FPGA is kind of unique versus CPUs and GPUs. It's the ability to really define how does my data flow, and on the other hand, the ability to reconfigure so you can actually shape shift your device to match your need.
[00:41:15] Unknown:
And then for people who are adopting Swarm 64 and particularly the hybrid row column store, how does that impact the way that they should be thinking about data modeling and the table layout? And, also, for people who are working with very large datasets, any partitioning
[00:41:33] Unknown:
or sharding considerations that they might have? Pretty close to Postgres itself. In general, you can partition without data, but quite often, it's not needed because often, partitioning is a requirement to overcome certain performance bottlenecks, and we don't necessarily require that. However, it is entirely possible to do it. So, in terms of paradigms, there's nothing new to learn. It's really quite standard Postgres. It's just another storage format you can choose and where you can get additional benefit. Having said all that, this storage is really kind of an expert option. It is to get the best possible performance. When people use our product, they will already get a benefit from all the other features, and the kind of additional storage is kind of the icing on the cake. So what we generally recommend our customers is start slowly and work your way into it. We don't propagate any big migrations, any big changes, in particularly when you're coming from Postgres. There's usually a few small tweaks you can do, and you will see dramatic differences.
[00:42:36] Unknown:
In terms of your experience of building this product and growing the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned? 1 of the things I I found, really, really interesting is to see how our customers and also how our excellent,
[00:42:52] Unknown:
solution engineering team has actually solved some of the things. And it shows you kind of the boundaries of your product and it being used in ways it wasn't really intended to be used, which was really fun to see. So let me just give you 1 specific example. There was an issue around a query processing speed, and that was actually already a year and a half ago. So it's actually quite much earlier version of the product. But, essentially, the way how the customer and, 1 of our solution engineers got around the problem was they actually turned everything into a Swarm 64 based table format like I've been describing earlier. And this was really a transactional table it started from, but it was so cheap to make that secondary copy because everything was very, very fast ingested and accelerated by the FPGA.
And then it was also very, very cheap to process once it wasn't that format that actually the entire round trip was still significantly accelerating the query. So that was kind of really unexpected. If you think about it, okay, I'm having a table and I'm doing a very heavy operation on it. And then no. Wait a second. You actually don't. You take a copy of the k table and then you take and basically take the operation on the copy of the table and you're still faster than processing the original table in the first place. That that was quite fascinating actually to see that happen. So that was really a, a kind of learning point. Now in a way, that was also a little quirky. So with our newer versions, we would now recommend a different design.
But, you know, in the end of the day, it's really, really fun seeing your product being used and, seeing some of the performance benefits being put to quite unexpected users.
[00:44:34] Unknown:
And when is Swarm 64 the wrong choice and somebody might be better suited either just using Vanilla Postgres or some of the other plugins in the ecosystem or migrating to a different set of database technologies?
[00:44:48] Unknown:
So generally, if your problem is small or your system is small, I think we're probably not the right choice. Example, some people say, oh, I'm I'm I'm running a big database server, and what they really mean is they have 4 or 8 physical cores and then 8 or 16 physical threads. And this is really the kind of level where added parallelism becomes a little pointless because, you know, there isn't that many cores to go around in the first place. So what do you wanna paralyze? Similarly, Postgres itself performs pretty well even with these kind of challenging industry benchmarks if you're moving into areas like 10 gigabytes or 30 gigabytes of data.
So, I wouldn't say for anything in that range, Swarm would really be relevant. But it can some sometimes already be relevant for 100 gigabytes of data. And then as you move up from there, like into terabytes, into tens of terabytes, 100 of terabytes, That's definitely a range where we feel very, very comfortable. So too small a system or too small a problem or often the combination of the 2, that's something that's definitely, not so suited for us. And then the other part, as I've mentioned, is, we're not trying to introduce a new tool and kind of invent any new paradigms.
So you should be looking at Postgres as in maybe using Postgres already, maybe considering Postgres. I think this is also a kind of qualifying criterion, so to say. If you really wanna work on something that is, for example, NoSQL style, you know, you shouldn't be looking at a Postgres extension. Right? So that is, I think, another point. However, it doesn't mean you have to be in Postgres already. We find people who are looking at Postgres
[00:46:38] Unknown:
coming from those proprietary data warehouses we've been talking about in the beginning. And for those, we can actually be an excellent choice. And then as you look to the near and medium term of the business and the technologies and the evolution of the Postgres ecosystem, what are some of the things that you have planned?
[00:46:55] Unknown:
So in general, this notion of becoming more and more invisible, I would say that's kind of the overarching concept. So if I kind of imagine, and I'm looking at where we are a year from now, is you start with a server and you add some obtained DC and you press installing extension. There's SWAMP 64 using the Optane persistent RAM from Intel, and it will just be doing everything. Invisibly, you will get the acceleration. The same with an FPGA card, maybe on a cloud instance, you choose now, okay, I wanna use a certain FPGA enabled cloud instance, or I'm installing an FPGA card into my server or I'm buying a new server that already has an FPGA card, say, Intel Xilinx or a smart SSD or an array of smart SSDs drives from Samsung.
And then all you do is you install the extension. We detect the hardware. We adjust our pattern. That's really that's really where I'm seeing ourselves going in future. So we've been able to show very, very good performance with the product we have now. It can be dramatically different, like the 50 x on some queries. Usually, you see 10 to 20 x depending on your your workload. So big, big acceleration. That's great, but we now wanna make it easier and easier to use, and that's really where I see ourselves going. So you're adding hardware or you're ordering a server that has new hardware. And then using our extension, you'll be able to use it very effectively, and it will kind of all fall into place behind the scenes. And we've got some pretty promising, prototypes of that running in our lab.
And so I'm very confident we'll go that way, and we'll become more and more invisible apart from, of course, the massive performance differences that we wanna make for our users.
[00:48:48] Unknown:
Are there any other aspects of the work that you're doing at Swarm 64 or the Postgres ecosystem or some of the analytical use cases that we've highlighted that we didn't discuss that you'd like to cover before we close out the show? Well, 1 thing I want to to mention is a is a big shout out to the community.
[00:49:05] Unknown:
We've managed to get our first patch through. So this will, this has now been been pushed, which was great. This is about making it easier to back up also foreign tables in the Postgres environment. So that will go into 1 of the new upstream version of Postgres. Should be there in version 13. So big out big shout out to the community for that. And in general, we're seeing ourselves as a member of that community. So we're looking all the time at, okay, what can be contributed. We're also looking, very much into the initiatives of the community around this persistent RAM obtained DC and, of course, FPGAs and accelerators. So big shout out to all the companies there in the Postgres ecosystem that makes it a lot of fun to be there because you've got so much support for this database.
[00:49:50] Unknown:
So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll definitely have you add your contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:07] Unknown:
Okay. I would say, actually, that a really, really powerful open source visual BI tool that kind of interacts with these different databases, I think that is something that could be quite transformatory. So think about an open source Tableau and and with that kind of power and and capabilities. I don't wanna discount any any of the of the projects that are that are out there, but I think there's definitely room for 1 of those existing projects to grow into really feature rich and easy to use visualizers that just connect to different database back ends and then just just run. So, and maybe overlooking something obvious, but, from all the tools we've been using in Swarm, all the ones open source, we didn't find something that is quite as powerful
[00:51:01] Unknown:
as some of the, proprietary offerings out there. So that is maybe something that could be quite transformatory in getting people into thinking more about data management and utilizing their database in the context of the tools to the maximum than they are using it today. Yeah. I can definitely agree with that. That there are a bunch of great point solutions or great systems that have a lot of features, but aren't necessarily very accessible to people who don't wanna dig into a lot of custom development for it. So, I'll I'll second your point on that. So thank you very much for taking the time today to join me and discuss the work that you've been doing with Swarm 64 and trying to optimize the capabilities of Postgres and simplify people's use cases there. It's definitely a very interesting project. So I thank you for the work you're doing, and I hope you enjoy the rest of your day. Thank you very much. It's great to talk to you. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Thomas Richter and Swarm 64
What is Swarm 64?
Bottlenecks in Open Source Postgres
Use Cases Benefiting from Parallelism
Benefits and Challenges of Keeping Data in One Database Engine
Implementation and Evolution of Swarm 64
Hardware Landscape and Cloud Challenges
Compatibility with Postgres Versions and Extensions
FPGA Capabilities and Benefits
Data Modeling and Table Layout Considerations
Lessons Learned and Unexpected Uses
When Swarm 64 is the Wrong Choice
Future Plans and Evolution of Swarm 64
Community Contributions and Final Thoughts