Power Up Your PostgreSQL Analytics With Swarm64 - Episode 133

Summary

The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!

Tidy Data LogoTidy Data is a monitoring platform to help you monitor your data pipeline. Custom in-house solutions are costly, laborious, and fragile. Replacing them with Tidy Data’s consistent managed data ops platform will solve these issues. Monitor your data pipeline like you monitor your website. It’s like pingdom for data. No credit card required to sign up. Go to dataengineeringpodcast.com/tidydata today to get started with their free tier.


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.
  • Your host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Swarm64 is?
    • How did the business get started and what keeps you motivated?
  • What are some of the common bottlenecks that users of postgres run into?
  • What are the use cases and workloads that gain the most benefit from increased parallelism in the database engine?
  • By increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer?
  • Can you describe how Swarm64 is implemented?
    • How has the product evolved since you first began working on it?
  • How has the evolution of postgres impacted your product direction?
    • What are some of the notable challenges that you have dealt with as a result of upstream changes in postgres?
  • How has the hardware landscape evolved and how does that affect your prioritization of features and improvements?
  • What are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64?
    • Which extensions conflict with yours and how does that impact potential adoption?
  • In addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture?
    • What are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA?
    • Most people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA?
  • For users who are adopting Swarm64, how does it impact the way they should be thinking of their data models?
  • What is involved in migrating an existing database to use Swarm64?
  • What are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the product and business of Swarm64?
  • When is Swarm64 the wrong choice?
  • What do you have planned for the future of Swarm64?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances, go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data? tidy data is the data ops monitoring platform that you've been missing. With real time alerts for problems in your databases ETL pipelines or data warehouse and integrations with slack pager duty and custom web hooks you can fix the errors before they become a problem. Go to data engineering podcast.com slash tidy data today and get started for free with no credit card required. Your hostess Tobias Macey, and today I'm interviewing Thomas Richter about swarm 64, a PostgreSQL extension to improve parallelism and add support for FPGAs. So Thomas, can you start by introducing yourself?
Thomas Richter
0:01:30
Yeah. Hi, my name is Thomas. I'm CEO and co founder of swarm 64. And I'm a strange beast because I live at the intersection of business and data and data management programming. So that's what I do and enjoy very much.
Tobias Macey
0:01:46
And do you remember how you first got involved in the area of data management?
Thomas Richter
0:01:49
So probably the the first real exposure to enterprise grade data management and data wrangling was as an intern almost 20 years ago, when I was working at Lufthansa Cargo As a German national airline, and in their car department, they did something that today you would probably call a data science back then they called it sales steering. And I basically pulled out data out of it large IBM Cognos based data warehouse and all the beauty of OLAP cubes and the like. So that was my first exposure to that space. And I've since been always at this kind of intersection point, as I mentioned. So I very much enjoy basing business decisions on a vision of the truth. And I think the most objective vision one can obtain is really looking at the data and the underlying, basically the underlying effects. And then you can make much smarter decisions, because you basically are looking at to prove an hypothesis as opposed to just argue opinions. So I've been through my career always been at that kind of cross section. And when I had the opportunity to found something in the data space, I was very excited about it. And so we basically builds form 64.
Tobias Macey
0:03:01
So can you describe a bit more about what swarm 64 is and some of the work that you're doing there?
Thomas Richter
0:03:08
So swamp 64 is an extension for the hugely popular Postgres database. And I think to your listeners, Postgres will not be a new concept, right? It's very widely vary, widely adopted, and usually popular. And what we do is we extend into it. And we are basically accelerating it for reporting analytics time series, serious spatial workloads, and also for hybrid workloads that include transactional, and it is Eagle components.
Tobias Macey
0:03:38
So that's what we do. And can you give us some of the backstory of how the business got started and what it is about it that keeps you motivated and keeps you continuing to invest your time and energy with it?
Thomas Richter
0:03:49
Yeah, that's a that's a very good question. And it's also quite an interesting journey that we took. So when we started this, and this was, I mean, that sounds horribly stereotypical, but this was actually started in Berlin. Coburg space. Yeah, so we really my co founder and I met at a co working space. And we started to go at it initially very much from the hardware angle. So my co founder had developed some of the earliest mobile GPUs. And we were basically looking at data processing from a hardware angle. And as we evolved, we learned from interacting with our customers that everybody wants a full solution, you don't want to have some kind of piece that you have to puzzle together, you really enjoy having a full solution. And for us, really, Postgres then came in naturally, as a system, we could accelerate. And not only with hardware, that was our original take, but also we've built a stronger and stronger software component to it. So as I will be explaining later, you now have the choice between software and hardware components as you want to add them as options. And so yeah, that's how we started And I think the part that I particularly enjoy about where we've come since we started this is that we're now in a situation where we can really challenge some market players. And I'm talking about the big proprietary databases that are really good products, but they're also very expensive. Especially in the area of data warehousing, we can now lift Postgres, which has already a fantastic feature set to a level of performance that it can suddenly compete. And this act of moving open source into spaces were previously only proprietary solutions could address the business problems. That's something I find very rewarding because it's a little bit like playing Rocky Balboa, you know, you're like the small guy, and you're going in, and you're playing there and you're kind of fighting to win the title against some of those really heavyweight champions, and I find that quite rewarding. It's a big challenge, but that's kind of where the fun is as well.
Tobias Macey
0:05:59
And In terms of the bottlenecks that exist in the open source Postgres, what are some of the common ones that users run into that prevent them from being able to use it for the full range of use cases that they might be intending to? And what would lead them to maybe move their analytical workloads into a data warehouse or a data lake versus continuing to run them in process with the Postgres engine and the Postgres database that they've been building up for their applications and their business?
Thomas Richter
0:06:28
So I think this is actually already very well framed because Postgres in itself, I mean, as we all know, it's been around for 30 plus years, and it's a really mature and powerful product. However, I would say it has a blind spot in the area of parallelism and some of the things that that hang together with it. And when I talk about parallelism here, I talk about the ability to deploy those modern multi core systems and deploy many, many cores like 10s or hundreds To single problem, the kind of MPP style processing that those proprietary products already Master, Postgres kind of got as an a bit of an afterthought. So if you look at this this feature called query parallelism, that was added in Postgres 9.6. That's already approximately 20 years down Postgres history lane, right. So it's something that has been added very late in the development cycle off the database.
0:07:29
And whereas it's a,
0:07:31
it's a great extension, we love it being there. It is really not going as far as we personally believe it should do, and that's why we are extending it. So query parallelism is one of the bottlenecks. Usually when you're finding it difficult to deploy a lot of your course in your multi core system to your Postgres queries, then swamp cc four can probably help you. Similarly, scanning large amounts of data that are not lending themselves to an index. Quite simply because indexes are great if you're trying to find a needle in the haystack. But what if you're not trying to find the needle in the haystack? What if you're trying to scan a range, that is effectively a third of your table? Again, Postgres isn't very fast at scanning. So it will really hurt when you try to run these kind of crews. Then another area is, of course, the concurrency of complex queries. So you have queries that fall into the first or the second category, and it's just been describing, and then you try to run multiple of them in parallel. And you will see how your individual Postgres workers are kind of scrambling for i o and kind of competing with each other. This is something that swarm also addresses. So complex query, concurrency is something that is also a challenge we see in the field. And finally, and this is true for any database, we're just trying to, you know, help and contribute to it. Certain query patterns are difficult to process and there is always the question about Okay, should they use a rewriting? Or should you provide some additional intelligence, for example, to execute certain anti joints smarter and things like that this is really a kind of a never ending debate. Now the default choice would usually be to rewrite the queries. But there is often a scenario where this is not desirable, or whether this is just not an option, because the queries could, for example, have come from an application that the user can't touch. So current patterns that are difficult to process are kind of the fourth element. So in summary, crew parallelism, scanning large amounts of data, many concurrent complex queries and crew patterns that are difficult to process. Those are kind of the four areas where we see a lot of challenges in Postgres, when you try to scale it to a large degree. And when I say a large degree, I mean, we're talking about at least 1624 threads, like 12, eight to 12 course. Right? We're not talking here about your little sister. database running on 10% of the server having a size maybe of one or two gigabytes, we're looking at larger problems, hundred gigabytes terabyte range, something like that.
Tobias Macey
0:10:10
And in terms of Postgres, it's a common database for application workloads and for being able to do some lightweight analytics. But what are some of the common bottlenecks that users run into that prevent them from being able to use it for all their different use cases? And that might lead them to use a dedicated data warehouse or a data lake engine for being able to do more intensive analytic workloads?
Thomas Richter
0:10:33
Yeah, thanks. So it's a very good question as head and as you already framed in your question. It's, it's like Postgres itself is extremely versatile, but it usually struggles as the data it's a quantity of data grows, and generally, it tends to gravitate around for different areas. So the first one is related to query parallelism. So parallelism has been kind of added to Postgres, when it was already something like 20 odd years ago. years old. So you're looking at the so called crew parallelism feature being added in version 9.6. So we're now at version 12. So this is only actually a few versions ago. And that means that when Postgres executes, it doesn't utilize modern multi core systems quite to the degree as you would in a data warehousing context, where you're usually working with this MPP massively parallel processing paradigm. So Postgres is kind of holding itself back a bit. And it's also missing some of the features to move data during the crew process to actually keep the query parallel for a very long time. The second part is that scanning large amounts of data that are not lending itself to an index is a challenge for Postgres. And when you're implementing an index, you're usually kind of solving the kind of problems where you're finding a needle in the haystack or few needles in the haystack. Whereas when you have a query that requests to scan effectively a third of your table indexes helps you much. And this is really where posters will then struggle. The other part is what if you have many complex queries, and they could be off the first or the second kind of just been describing concurrently, like many concurrent complex queries, they will really push the limits on your kind of storage bottlenecks. And the way you retrieve the data, the way data propagates through Postgres, so that's another area where we are seeing bottlenecks. And finally, my fourth point is that no databases perfect isn't being able to execute every query in a perfect way. However, there's certain query patterns from the domain of data warehousing that are much Well, they really require a little bit of special processing a little bit of optimization, for example, handling anti joints differently and things like that. And this is really where there's the career patterns that can really turn Postgres queries into so called never come back queries. So those are four areas query parallelism, the scanning of large amounts of data Many concurrent complex queries, and then certain query patterns that are just turning a query into never come back. Those are the kind of things we are seeing. And in general, this is especially relevant when you're moving into hundreds of gigabytes or terabytes or beyond, it tends to be a lot less relevant when you like, attend 20 gigabytes of your total database size.
Tobias Macey
0:13:24
And in terms of the use cases that benefit from this parallelism, the obvious one is the data warehousing use case where you're being asked to perform large aggregates on data sets, and maybe just for specific columns within a set of rows. But what are some of the other use cases that can benefit from the increased parallelism and some of the hardware optimizations that you're building into swarm? 64?
Thomas Richter
0:13:47
Yeah, you've mentioned data warehousing. That is, of course, the obvious one, and in all honesty, also, data warehousing, I think is a very, very catch all expression because it really moves legged, it's really there's a very wide variety of how you can have your underlying schema design or what kind of queries you're asking. So that's already a very, very broad field. However, there's also other areas that are quite irrelevant. So for example, anything that is allowing a kind of user based dashboarding reporting element. So in other words, you may have BI tools, you may have custom dashboards, you may have a service Software as a Service solution that includes some customer interaction. Let's take your salesforce.com as an example, right? These kind of applications, they allow your users to drill down to aggregate to find out what is my current status. So in other words, there's a lot of concurrent reporting dashboarding happening and these kind of problems, we're able to address very effectively. So it's kind of coming back to the point I've mentioned earlier, many, many concurrent complex queries. So that's another use case where we see ourselves Being very popular and a very good solution. And then another area is actually kind of more what we call new developments in the area of geospatial data, for example, or machine learning. So just as an example, we did a project with through your time in Japan, and there it was around the subject of connected cars, analyzing geospatial data, and also looking for a certain kind of response time, a certain predictable response time. And we were able to keep that response time window for much, much longer time than standard posters without the sponsors for acceleration. So if you then kind of translated that back into cost, we actually found that we could get away with much less hardware and as a return, or as a result of that, you would basically lower your costs by as much as 70%. So that's one area like geospatial data time series, data processing. But again, time series probably in context. Yeah. We're not trying to To be the next timescale dB. But we are allowing people to process time series if they have the need of it, in addition to, for example, that geospatial data or the reporting data, etc. And then, as I've mentioned, the other area machine learning. It's very interesting when you need to have a certain kind of response speed. So that kind of snappiness that posters generally has, and combine that with actually feeding in a lot of machine learning data. And at the same time, pulling out a lot of data to feed your models. And this is something that we're doing with a company called turbot, they're in the renewables energy space, and they are optimizing their optimizing wind turbines for energy generation and how they're actually positioned. And they're, we're in the area of optimizing wind turbines, and also looking at predictive maintenance cases. So these are just some areas. On the one side, the big data warehouses Link space, many, many different use cases in that field, but also things related to dashboarding reporting anything in that field. And then of course, any new developments geospatial data will be immensely powerful posterous extension and the machine learning space. Those are some of the areas that people find this very interesting.
Tobias Macey
0:17:19
And then because of the fact that they're able to get this improved performance out of their existing Postgres database, it removes the necessity to do a lot of the data copying and can simplify the overall system design. And I'm wondering what are some of the other benefits that can be realized by keeping all of your data in the one database engine, but what are some of the challenges that it poses as well, particularly in the areas of doing things like data modeling within the database or for the data warehousing use case? Being able to generate some of the history tables so that you can capture changes over time and things like that?
Thomas Richter
0:17:54
Yeah. So that's a very good question. Let me first kind of frame the environment a little bit. So this is very much thinking in the Postgres world, right? And what I mean to say there is, Postgres is a very schema based database, right? So we're not looking, I mean, Postgres has pretty good document store capabilities. But again, everything you encounter, the way you work with a database, it's really like the schema is at the heart of it. And things that are maybe schema lists are schema lists for a certain time, and you then use special operators to work with it. And that's kind of a very conscious choice. But in general, if you're comfortable with a world of SQL, with a world where there's defined schemas, then this will be extremely versatile. And you will be able to process certain elements like for example events in arrays or or certain schema less elements and documents, that is all possible. But your your base assumption should be around a schema based world and so that's that's something that's quite Important. So if you're in an environment where you're willing comfortable working with explicitly defined schemas, you will find this extremely versatile. And as I've already mentioned, you'll be able to find solutions for all your different problems. Like, for example, being able to time travel in your data, being able to audit what has happened, what has happened in terms of changes and so on. Yeah, so I would say if you're thinking Postgres, if that's a mindset you like, you will get very far into all sorts of spaces data warehousing, logging, geospatial data processing, mission, time series, data processing, machine learning, those kind of things, you'll be able to expand into staying within your comfortable kind of Postgres working paradigms. I think that is the key. I think that is the key qualifier. So if you're happy with sequel, if you're happy with Postgres, this is then extending very naturally
Tobias Macey
0:19:59
and in Increasing the processing throughput of the database can be beneficial for things that are big that are compute intensive, like being able to parallelize the queries. But how does that shift the overall bottlenecks and impact the disk IO in terms of the overall throughput of the database?
Thomas Richter
0:20:18
Yeah, I mean, the the obvious thing that always happens when you're moving to MPP, is you run into this a bottleneck you've mentioned, right? It's suddenly, in many, many queries, it becomes the question, How fast can you fetch the data? And one of the things that we did, and we'll probably touch on some of the other things when we talk a little more about architecture, but some of the things we did was we created our own storage format. And that is a hybrid row column format. It has some columnar indexing. And the choice where we went hybrid row column is because Postgres itself has a row store. And as I always say, it's very difficult to teach a roaster Complete Columnstore tricks, right? It's like you'll, you'll end up possibly with the worst of both worlds. So we kind of embraced the part of the row store concept and build that kind of row column hybrid format that allows you to still process queries in kind of adhering to the Postgres logic, we are compressing those we are making our own kind of Postgres has this little data pages, and we're keeping a bit bigger data pages that are also compressed. So this generally tends to work very, very well. And then there's some columnar indexing, as I've mentioned, to also allow to be a bit smart and not retrieve everything and kind of in discriminatory of the query. So you can be a bit selective, I would guess, maybe some kind of skip reading or a certain range indexes that would probably be the closest thing there. Yeah. And all that is kept in in a format and this format can be processed by the CPU, but This format could equally be processed or can equally be processed by our hardware extensions. Like for example FPGAs. So we're looking here at two things FPGAs and smart as these that are capable of reading these formats and then doing a lot of processing of those along the way. And that's usually helps you, it basically resolves the i o bottleneck with compression, especially aren't selective fetching, and then processing and additional hardware.
Tobias Macey
0:22:26
So digging further into the actual implementation of swarm 64. You mentioned that it's a plugin for postgrads. But can you talk through some more of the technical details of how you approach that and some of the evolution that it's gone through since you first began working on the problem?
Thomas Richter
0:22:41
So what's pretty good is that we came into it, having built already other database extensions, so we were really looking into Okay, what were the things the lessons we've learned, and we made the conscious choice to stay on an extension level with posterous In other words, we would not go in and build our own Postgres. And as many of your listeners probably know, a lot of the popular projects and products in the market are actually Postgres derivatives, you have the examples of Amazon redshift, you have the examples of IBM, and a teaser, or, for example, a pivotal greenplum that all swans are all once upon a time version of Postgres that would then kind of take private and foreign to the new project. And we decided to not do that. So we started with looking at, okay, where are extension hooks that we can use Where are certain API's that we can use, and we started kind of expanding from there. And posterous is very, very versatile in that space. I mean, it's probably among the most extensible databases there are including closed source databases, so both open and closed source databases, probably Postgres as among the one, the one that is most active sensible. And what you can do is you can define certain ways in how your data is accessed. Example custom scan provider, you can define ways and how your data is stored. We started with the foreign data tables, the foreign data storage engines, because there was no native storage engine yet at the point we started, then now is in version 12. We are very eager to see how this kind of table storage API will evolve over the future, we may actually go much more in that direction. But for now, it's really a combination of defining certain tables sources, in our case, that foreign table API, combined with certain access paths that we can define certain career planner hooks we can provide to Postgres certain cost functions. So it's really been designed very, very well in terms of extensibility. And you can just provide and kind of offer yourselves to all these different extension hooks and then your respective functions. will be called. And you have the ability to tell box standard post, Chris about all the great things you can do in addition, and this is how we worked. And we realized that a lot, it's not the easiest way of working. But it's in a way the most rewarding because on the one side, you're really benefiting from the Laura effort and overhead to move between Postgres versions. And secondly, it was actually very easy for us to support other solutions. For example, our product also works for enterprise dB, and enterprise, DBS, Postgres Advanced Server is actually not open source. And still, we were able to compile for Postgres Advanced Server by enterprise dB, and we're able to run on that. So now you can also use our product in solution like enterprise dB. And that would have not been the case if we hadn't gone for such a kind of modular pluggable architecture that Postgres was offering us. Now that is on how we kind of work into the system. Let me Just cover a few parts of what we're actually doing. So on the one side, if we kind of take the anatomy of a query, it goes into the system. And we are basically then offering posterous. In addition to all the different data handling mechanisms, the test itself, we're actually offering it additional ways to process the query. So for example, we offer it to move data around during the query, so called shuffling. And so the query can stay parallel for longer. That's one of the things we do we offer Postgres, our own joint implementation, specifically optimized for joining very large amounts of data. So if you want to join tables that have a few billion rows with tables that have a few million, or even a few billion rows themselves, that is something that can very quickly bring Postgres to its limits. And what we did is we have a special joint implementation for that. So that's something that is offered to Postgres, and it can pick it if it wants to, we offer certain crew writing patterns. So if we can basically know Notice that something is going to be executed very badly because, for example, it is not going to be maybe it's a very linear execution mechanism, as opposed to you could do it in parallel, then we will, we will offer that to Postgres and the Postgres query planner will then pick and choose, once the query is planned and gets executed, we have the matching execute or notes to all these things. And also we have this accelerated i O, I was mentioning before and when it comes to processing, we can actually offload sometimes the entire query to the hub accelerators. So there's optional hardware accelerators, you can use FPGAs you can use Samsung Smart SSDs and those FPGAs from Intel or sailings or smart cities from Samsung. They will then receive instructions and process data according to the query and only return the results and so all in all, there is a host of different functions we are offering To Postgres, the query planner will kind of choose, like from a buffet. And if you have the optional additional hardware acceleration, it will also offload and push down a lot of the query processing directly to additional hardware and making your system thereby even more efficient.
Tobias Macey
0:28:17
And then another element of this equation beyond Postgres is the available hardware. So you mentioned FPGAs and smart SSDs. And I know from looking through the documentation that you also have support for the Intel octane persistent memory. And I'm curious how the overall landscape of hardware availability has evolved since you first began working on this and some of the challenges that things like the cloud pose for people who are interested in being able to leverage the specialized hardware that you are able to take advantage of
Thomas Richter
0:28:49
Yeah, that's a that's a very good question. And that's also something where I'm happy that the market has really moved in, from our opinion, the right direction, because when we started with this, I met Earlier that we came from a very, very hardware driven world. And we were very early on using these FPGAs as a prototyping platform first for processing data and using database processing, and then many changes happened in the market. On the one side, you suddenly had an Intel has really been on the forefront of that moving FPGA devices into the data center. And then silence also followed, then Amazon suddenly, already years ago, introduced an FPGA based instance into their cloud. And then from there onwards, it's really been step by step by step and more and more clouds are enabling data centers get style or data center grade FPGA accelerator cards in terms of cloud support in the context of swarm 64, because actually, it now becomes too many to mention everyone in terms of who's supporting FPGA but let me just mention the ones that we directly support. So of course, you've got Amazon, you've got Oviedo to large French data center, you've got NIMBY acts as a US based high performance data center. And it's public knowledge that Azure is coming out with an FPGA instance. So those are just some of the market and and the players that we are focusing on at the moment. And there, you can really get access to FPGAs in the cloud quite easily in the instance type. So it makes it ever more easy to deploy them on premise. Those are, those can be obtained through OEMs. They're basically extension cards. They look like GPUs more or less. Yeah, just a very, very different profile of what's inside. But if you're just looking at the PCIe card, it looks more or less like a GPU, so nothing, nothing new and exciting outside of the box. But of course, it gets quite exciting when you look inside.
Tobias Macey
0:30:51
And then another area of complexity is because of the fact that you are acting as an extension to postgrads you need to be able to support whatever different versions people are running in their installations. So while there might be a new feature that simplifies your work in version 12, as you mentioned, the table storage API, you still need to be able to be backwards compatible to whatever Postgres is supporting in order to be able to take advantage of a wider range of adoption. And so I'm wondering how the overall evolution of postgrads has impacted the product direction and the engineering work necessary on your end to be able to build in all the features that you're trying to support, as well as the challenges that you're facing in terms of being able to support the range of versions that are available for people who are running Postgres.
Thomas Richter
0:31:39
So in general, we are like, sometimes people are directly plugging us into the existing database. But in general, we're proposing a one time backup on one time restore quite simply because when we are deploying to our clients, we usually give them a container based deployment that is, I know there may be some people that are religious About the teeny tiny bit of performance a containerized approach might cost but just in terms of ease of deployment it is it makes it so incredibly easy that we're really that we're really in the predominant amount of cases actually managed to convince the client to do it this way. And to be honest, also 80 90% of the clients already very, very happy with just going with a container. So when you actually getting swarm, you will be getting a kind of match set, it will be a box, Senate Postgres, but have the right version that we recommend, at that moment, combined with our extension and combined with all the relevant settings you need in a container. And then if you're using an FPGA or obtain persistent memory on your system, it will also have all the right configuration parameters to make it really, really easy to deploy to this hardware. So you're getting almost kind of cloud like comfort there. And we're basically, by the way, doing the same with all our machine images for the different cloud instances I've been mentioning. So we really think it's much more convenient. Do a one time backup and restore and then not fight with any configuration parameters or any details then actually trying to retrofit into kind of every single Postgres version that is out there. However, having said that, and we will also make the deployment into more broadly available Postgres versions easier. So that maybe half a year down the line a way on how you can extremely quickly just install the extension into something probably from Postgres, 10, or 11, onwards to 12 or 13. So a fairly broad window of versions that we will just support out of the box. And just to pick up the detail on your question there with the Postgres storage engine, we're at the moment not utilizing the storage engine, because we are actually waiting for how this will evolve. But you're right, once we've actually made that pivot from the foreign data to the storage engine, that will actually be forcing us then to basically keep two versions maintained. So depending on Which post consumer version you're in, you would basically support us in one way or the other. So that comment is true. But in general, we've been so far knock on wood been quite successful in keeping pace with postures.
Tobias Macey
0:34:12
And then the other element of compatibility is in the other extensions that people want to be able to use alongside your work at swarm 64. So I know you already mentioned post GIS, which is one of the better known extensions of the ecosystem, but what are some of the other ones that people will commonly look to use alongside swarm? 64? And what are some that you know to be conflicting, that won't work if they're using your extension?
Thomas Richter
0:34:38
Yeah, let me let me try to answer that question a little more on that on a high level. So in general, people love using extensions, and also something that's extremely popular. It's not only post GIS, but it's also any kind of extended data types. So like custom data types, and so on, which is really one of the strongholds of Postgres, of course are important to support and that's something we do and that is very, very useful. So I would say custom data types. And that kind of custom functionality around Postgres extensibility is really what we see most. Now, what does not work with swarm in the current version. And there's a change coming, which I will just tease a little bit. But in the current version, as I mentioned, we're keeping our own storage of the data. So anything that relies on how Postgres data is stored will not conflict but require a workaround. In other words, so what we generally propagate is people to use a mix of what we call native tables. Those are the ones that do not have this columnar storage is a standard Postgres tables, and also some of those accelerated tables but generally mix and match them just like horses for courses. Now, when you then use a solution, like for example, a background backup tool, that kind of invisibly just copies pages that is usually relying on some knowledge about how Postgres data looks and hence it will run trouble when trying to work with swarm data. Similarly replication schemes that are based on for example, how the data is stored on disk. Again, similar issue, however, we have recognized that for customers, it is sometimes actually quite useful to be able to just retain the data exactly as they store it. And so in a upcoming product version, we will be looking more into what we call the complete drop in where people have more of a choice, they still have the ability to get the extreme acceleration for certain amounts of data. Maybe these are the kind of append only data we were talking earlier about history tables and things like that they would perfectly fit into our format. However, you may have other data formats that are probably replicated between multiple Postgres databases, etc, where you would choose a different storage format. And this is really where the upcoming product versions will go. And they will allow you to keep your source format for the cases where it makes complete sense and still give you 100 amount of acceleration as well as use the bespoke bespoke analytical storage format for the cases where you want extreme performance
Tobias Macey
0:37:08
and digging more into the FPGA capabilities. I know that most people are going to be familiar with the concepts of the CPU and the GPU as a coprocessor. And some of the relevant statistics of those different pieces of hardware for being able to select one that will fit well. But for people who aren't familiar with FPGAs, or who haven't worked with them closely, what are some of the benefits that an FPGA can provide, particularly in the database context? And what are some of the specifications that they should be looking at when they're selecting one for installing into their hardware or for deploying into a cloud environment?
Thomas Richter
0:37:46
So let me take a quick look at what an FPGA actually is. So it's actually a configurable fabric. I often say it's like a blank sheet of paper that when it wakes up, it is told what it should be and it could fix that would be a piece of sheet music playing for like a symphony or something. And it's quite similar. It's like a, this blank sheet of paper being configured to be the processing logic you need. And how this translates to the area of databases is that we turn it into a piece of processing logic processing the individual data points in your data storage as they move through the chip. So as storage is moved through the FPGA, we've turned the entire fabric the entire FPGA processing into a custom logic for our database processing. Now, some people ask us, do you kind of compile every query into a specific configuration for that FPGA. So it does only that? No, we don't, we actually instead use the FPGA with a very kind of SQL specific, but still quite versatile processing unit that does all the processing as in the compression if you're looking at storing data As you compress and finalize with the FPGA, or if you're looking at reading data, you decompress, and you then execute the SQL query. And all that happens. While data is flowing through the FPGA, you have fantastically fine grained control over how your data moves. And I would say this is probably the single biggest difference in CPUs and GPUs on the one side, and FPGAs on the other that because you have that ability to reconfigure, you can make something very custom. And because you can make it custom, you can make sure data moves efficiently. I enjoy a little bit of GPU programming as a hobby and the challenges you have in making sure that your processing happens effectively, you know, the kind of knowledge you need to know about all your kind of cache hierarchies and how data moves. That's something you do not need to consider in the FPGA context. Because it's all determined by yourself you actually defining how data moves and hence So you can make it extremely effective and extremely efficient. So I would say that's one of the core elements. And then finally, another very interesting element is this is reconfigurable within split seconds. So one of the processing units we, for example, have is a streaming text processor that is capable to find a wildcard based strings. So like strings with fuzzy matching inside your data as it moves through very effective, but as you can imagine, that takes a little bit of space. So the FPGA being reconfigurable, you could effectively depending on your workload, have those units included or excluded vice versa. If you have, for example, a nightly or weekly load window, you could reconfigure your FPGA and that's basically something that our database does in the background, reconfigure the FPGA to be all writers, then it does the nightly load, and then it turns back into all readers to do the daily query processing. So those are examples of what the FPGA kind of unique versus CPUs and GPUs, it's the ability to really define how does my data flow, and on the other hand, the ability to reconfigure, so you can actually shape shift your device to match your
Tobias Macey
0:41:14
need. And then for people who are adopting swarm 64, and particularly the hybrid row column store, how does that impact the way that they should be thinking about data modeling and the table layout and also for people who are working with very large data sets, any partitioning or sharding considerations that they might have
Thomas Richter
0:41:35
pretty close to Postgres itself. In general, you can partition without data, but quite often, it's not needed because often partitioning is a requirement to overcome certain performance bottlenecks. And we don't necessarily require that however, it is entirely possible to do it. So in terms of paradigms, there's nothing new to learn. It's really quite standard posters. It's just a Another storage format you can choose and where you can get additional benefit. Having said all that this storage is really kind of an expert option, it is to get the best possible performance. When people use our product, they will already get a benefit from all the other features. And the kind of additional storage is kind of the icing on the cake. So what we generally recommend our customers is start slowly and work your way into it. We don't propagate any big migrations any big changes in particularly when you're coming from Postgres. There's usually a few small tweaks you can do and you will see dramatic differences
Tobias Macey
0:42:35
in terms of your experience of building this product and growing the business around it. What are some of the most interesting or unexpected or challenging lessons that you've learned?
Thomas Richter
0:42:45
One of the things I found really, really interesting is to see how our customers and also how our excellent solution engineering team has actually solved some of the things and it shows you kind of the boundaries of your product and it being used in two ways, it wasn't really intended to be used, which was really fun to see. So let me just give you one specific example. There was an issue around a crew processing speed. And that was actually already year and a half ago. So it's actually quite much earlier version of the product. But essentially the way how the customer and one of our solution engineers got around the problem was they actually turned everything into a swarm 64 based table format, like I've been describing earlier. And this was really a transactional table it started from but it was so cheap to make that secondary copy because everything was very, very fast ingested and, you know, accelerated by the FPGA. And then it was also very, very cheap to process once it was in that format, that actually, the entire round trip was still significantly accelerating the query. So that was kind of really unexpected. If you think about it, okay, I'm having a table and I'm doing a very heavy operation on it. And then no, wait a second. You actually don't you Take a copy of the K table, and then you're taken, basically take the operation on the copy of the table, and you're still faster than processing the original table in the first place. That that was quite fascinating actually, to see that happen. So that was really a kind of learning point. Now, in a way that was also a little quirky. So with our new versions, we would now recommend a different design. But you know, in the end of the day, it's really, really fun seeing your product being used and seeing some of the performance benefits being put to quite unexpected uses.
Tobias Macey
0:44:33
And one of the swarm 60 for the wrong choice. And somebody might be better suited either just using vanilla, Postgres or some of the other plugins in the ecosystem or migrating to a different set of database technologies.
Thomas Richter
0:44:47
So generally, if your problem is small, or your systems small, I think we're probably not the right choice. Example. Some people say oh, I'm running a big database server. And what they really mean is they have four, eight physical cores and then eight or 16 physical threads. And this is really the kind of level where added parallelism becomes a little pointless, because you know, there isn't that many cores to go around in the first place. So what you want to paralyze, similarly, Postgres itself performs pretty well, even with these kind of challenging industry benchmarks, if you're moving into areas like 10 gigabytes, or 30 gigabytes of data. So I wouldn't say for anything in that range swarm would really be relevant, but it can some sometimes already be relevant for 100 gigabytes of data. And then as you move up from there, like into terabytes into 10s of terabytes, hundreds of terabytes, that's definitely a range where we feel very, very comfortable. So to smaller system or to smaller problem, or often the combination of the two. That's something that's definitely not so suited for us. And then the other part is As I've mentioned, is we're not trying to introduce a new tool and kind of invent any new paradigms. So you should be looking at Postgres, as in, you may be using Postgres already, maybe considering Postgres. I think this is also a kind of qualifying criterion. So to say, if you really want to work on something, that is, for example, no SQL style, you know, you shouldn't be looking at a Postgres extension, right. So that is, I think, another point. However, it doesn't mean you have to be imposters already we find people who are looking at Postgres coming from those proprietary data warehouses we've been talking about in the beginning. And for those, we can actually be an excellent choice.
Tobias Macey
0:46:44
And then as you look to the near and medium term of the business and the technologies and the evolution of the postgrads ecosystem, what are some of the things that you have planned?
Thomas Richter
0:46:55
So in general, this notion of becoming more and more invisible, I would say that's kind of The overarching concept. So if a kind of imagined and I'm looking at where we are a year from now is you start with a server, and you add some obtain DC, and you press install extension. And there's 164 using the obtain persistent ram from Intel, and it will just be doing everything invisibly, you will get the acceleration. The same with an FPGA card, maybe on a cloud instance you choose now, okay, I want to use a certain FPGA enabled cloud instance. Or I'm installing an FPGA card into my server, I'm buying a new server that already has an FPGA card, say Intel sailings or a smart SSD, or an array of smart SSD drives from Samsung. And then all you do is you install the extension, we detect the hardware, we adjust our pattern. That's really that's really where I'm seeing ourselves going and future. So we've been able to show very, very good performance with the product we have now it can be dramatically different, like the 50 X on some queries. Usually you see 10 to 20 x depending on your, your workload. So a big, big acceleration. That's great. But we now want to make it easier and easier to use. And that's really where I see ourselves going. So you're adding hardware, you're ordering a server that has new hardware, and then using our extension, you'll be able to use it very effectively. And it will kind of all fall into place behind the scenes. And we've got some pretty promising prototypes of that running in our lab. And so I'm very confident we'll go that way. And we will become more and more invisible. Apart from of course, the massive performance differences that we want to make for our users.
Tobias Macey
0:48:47
Are there any other aspects of the work that you're doing at swarm 64, or the Postgres ecosystem or some of the analytical use cases that we've highlighted that we didn't discuss that you'd like to cover before we close out the show?
Thomas Richter
0:48:59
Well, one I want to to mention is a big shout out to the community, we've managed to get our first patch through. So this will, this has now been been pushed, which was great. This is about making it easier to back up. Also foreign tables in the Postgres environment. So that will go into one of the new upstream version of Postgres should be there in version 13. So big out big shout out to the community for that. And in general, we're seeing ourselves as a member of that community. So we are looking all the time at Okay, what can be contributed, we're also looking very much into the initiatives of the community around this persistent RAM, obtain DC, and of course, FPGAs and accelerators. So big shout out to all the companies there in the Postgres ecosystem, that makes it a lot of fun to be there, because you've got so much support for this database.
Tobias Macey
0:49:50
So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll definitely have you add your contact information to the show notes. And as a final question, I would Just like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today.
Thomas Richter
0:50:07
Okay, I would say actually, that a really, really powerful open source, visual BI tool that kind of interacts with these different databases. I think that is something that could be quite transformative. So think about an open source Tableau and and with that kind of power and and capabilities, I don't want to discount any any of the of the projects that are that are out there. But I think there's definitely room for one of those existing projects to grow into really feature rich and easy to use visualizers that just connect to a different database back ends and then just just run so maybe overlooking something obvious, but from all the tools we've been using swarm all the ones open source, we didn't find something that is quite as powerful. As some of the proprietary offerings out there, so that just may be something that could be quite transformative, and getting people into thinking more about data management and utilizing that database in the context of the tools to the maximum than they are using it today.
Tobias Macey
0:51:17
Yeah, I can definitely agree with that, that there are a bunch of great point solutions or great systems that have a lot of features, but aren't necessarily very accessible to people who don't want to dig into a lot of custom development for it. So I'll second your point on that. So thank you very much for taking the time today to join me and discuss the work that you've been doing with swarm 64. And trying to optimize the capabilities of Postgres and simplify people's use cases there. It's definitely a very interesting project. So I thank you for the work you're doing and I hope you enjoy the rest of your day.
Thomas Richter
0:51:49
Thank you very much. It's great to talk to you.
Tobias Macey
0:51:57
Listening Don't forget to check out our other show pod Cast dotnet at Python podcast comm to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!