Open Source Object Storage For All Of Your Data - Episode 99

Summary

Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you explain what MinIO is and its origin story?
  • What are some of the main use cases that MinIO enables?
  • How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?
    • Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)
  • What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?
    • What are the constraints and opportunities that are provided by adhering to that API?
  • Can you describe how MinIO is implemented and the overall system design?
    • How has that design evolved since you first began working on it?
      • What assumptions did you have at the outset and how have they been challenged or updated?
  • What are the axes for scaling that MinIO provides and how does it handle clustering?
    • Where does it fall on the axes of availability and consistency in the CAP theorem?
  • One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume?
  • For someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it?
  • What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage?
  • How do you approach project governance and sustainability?
  • What are some of the most interesting/innovative/unexpected ways that you have seen MinIO used?
  • What do you have planned for the future of MinIO?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at lead node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and OD today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening, putting in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum, enter data Council in Barcelona, go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Anand Babu Periasamy about min IO, the neutral open source enterprise grade object storage system. So can you start by introducing yourself?
Anand Babu Periasamy
0:01:50
Hi, this is Anand Babu Periasamy and one of the co founders of MinIO.
Tobias Macey
0:01:57
And do you remember how you first got involved in the area of data management enjoy
Anand Babu Periasamy
0:02:00
it was kind of accidental, I did not start out this way. Our original plan in it happened in my previous startup when we did last our project Gloucester the name Gloucester came from the new cluster, it was supposed to be a shared memory distributed operating system. And within within, like six months of the project, like customers, our customers started telling that their problem was not computing. And they had petabytes of data sitting on tapes, and they wanted to move them to drives to do large scale simulation, I actually could not find a good file system that I can adopt. So we actually built a file system from scratch. That's how I fell into this market. And once you've come into data management, you cannot leave right that then I stayed on.
Tobias Macey
0:02:54
It's funny how many times I've had that conversation where the actual product that somebody ends up offering had absolutely nothing to do with the original intent of their business or their project. And it was just because they happen to build something useful that everybody else then started asking for that they decided to turn that into their actual project and company most notable to me that's coming to mind is timescale dB.
Anand Babu Periasamy
0:03:15
Yeah. In fact, like, even the company name behind bluster, no one knows they are, it's called ZE research. And at some point, bluster became very popular because of Gloucester Fs, and we killed every other initiative, change the company name to, to cluster, so they would recognize us and a cluster became a file system is being open source. Know, if you build a good product, it gets noticed. And that's how it happens a lot
Tobias Macey
0:03:45
last time. And so now you can't seem to escape from file systems. And you have become the co founder of Min IO, which is a different type of file storage. So I'm wondering if you can just start by explaining a bit about what that product is, and some of the origin story behind the project and the business
Anand Babu Periasamy
0:04:02
this time. While I mean, I always an object storage, most people think that I am doing this because I became a storage guy. And because of my past experience with the file system, that's actually not the real story. I was at Red Hat for a while after the acquisition to make your the transition went smooth, and they are able to take take care without me. But what actually happened was, I was thinking about doing something more fun. We were working on me and a small team, we were working on bionics like ability to see through skin, you can close your eyes and see things like I wanted to do something more fun. But then when it came to actually doing a startup moment, you have VCs involved, and you want to do something that makes an impact the next six months, next 12 months, you want to show progress, right? And I went back to a asked you a very simple question laid in and in the object storage was an accident of that. The real question I asked was that 10 years from now, what problem if you pick today will still stay relevant, not only stay relevant, it has to compound and grow. And don't pick something that is trendy and short term a where every where I looked, the direction was simply the world will produce more and more data. And it was never about storage, right? Gloucester Fs was about building a storage system. This time around what I saw, the problem was it's about data. And data will be the heart of every modern enterprise, and how you build a powerful brand of trust and low. And then I saw that everyone was traveling with just managing vast amount of data, the closest thing they saw was HDFS and some distributed file systems like Lester Fs, and in 2008, it became quite clear even while I was doing the after that Amazon will convince the rest of the world that if you are willing to let go all the legacy interfaces like file and block, you can build a significantly better data storage system. And I knew that Amazon will convince the rest of the world. So I saw that building an object storage was a good starting point, my simple thesis was that if the data sits on a technology that we built, we can do many powerful things on top. And, and a good starting point would be to build an object storage. Everyone else thought object storage is a hard problem to go attack. And I thought that object storage is simply a distributed set of web servers, right. And we started out with object storage. And the idea being if you are inside AWS, you should use Amazon s3. But if you are outside AWS, what joy is you got? And that's the part that mean, I will should address in a my, my calculation is and it's also a bet right is the world will produce so much data, what percentage of that world's data will be sitting on Amazon s3, compared to the rest, I thought that bulk of the world's data will be outside of Amazon s3. And I wanted to go get the rest of the world and a powerful open source better alternative to Amazon s3. And it turned out to be great in the last four years, it really picked up widely.
Tobias Macey
0:07:26
So in the object storage space, there are a number of different contenders most notable being s3, as you said, but there are also products from Google. And as you're and then there are also some other open source offerings, most notable being the swift object store, which is built on top of stuff. And I'm curious if you can just characterize the overall landscape of object storage both now and at the time when you were first trying to tackle this problem. And some of the main use cases that you see for object storage and how that has evolved over the past few years.
Anand Babu Periasamy
0:07:58
It is plenty I have choices and which is a good thing. But if if mean, I've always just ate another alternative to one of those biosystems. If it's about like KD or you know, I'm an Emacs and VI there is really no good reason to just add another flavor to it. This is a market where users want to standardize on one system that is most prone because imagine like there are, if there are many choices, you pick one of them because it some how it what it came in your flavor. And then tomorrow, the project is not well maintained, your data is going to be held hostage, right. The real reason why we had to do blaster, like blaster back then and this time when I do is I actually could not put my own data in a system that that was out there. Back when I built Gloucester Fs, it was about like a every other system was so complicated with metadata management, if you if you get corrupted, you just you're out of luck. It's a complex systems don't scale. And this time around when it came to object storage, Amazon showed something very powerful that if you strip down everything and reduce it to basic get put lyst type atomic immutable operation, you would actually be able to significantly simplify the whole object storage implementation. And everyone else I looked around either the built a block store or a file system or a unified storage system, and added s3 object API gateway. If not, they are a SAS product, right? Like Google Cloud and others are a SAS product. And they are incompatible with Amazon s3. What what we wanted was something that is that is fully s3 compatible. And it is built for object API from the scratch, it did only one thing really, really well. Even some of the like swift stack, for example, like a swift API, you know, the debate of OpenStack itself, right? Like Kubernetes took off. And OpenStack has not made any any dent in terms of like replacing amazon for the rest of the private cloud. And swift a to me, when I started, I asked the community that you pick swift API versus s3 API, pick one. And it was very clear the industry wanted s3 API. And it's not just the API itself. If you look underneath most of these object storage systems, either they are a file block, and then an object API gateway on top are they have built a file system like storage system, and then added a API is a object API gateway? mean? I was a very different breed from the ground up, it is actually designed to be a single layer object storage system. I go into the details later. But it is a by our architecture by implementation, it's a very different system. And what how does it matter what why it matters to the end user. It is really about a very simple system that can do very powerful things. So why is it important today, I found that the data is only growing bigger and bigger. And every enterprise today is struggling with the data data management problem and the performance and scale everything has grown many falls that traditional systems don't scale to handle this kind of workloads in terms of performance and data library. at scale. If you have tried any one of those systems, just installing them itself requires significant expertise, then think about operating them at scale. And
Tobias Macey
0:11:49
another thing with Seth is that, as you said, it has this API gateway for being able to provide the object interface. And at the foundational layer, it's essentially a distributed key value store for being able to store the bits and bytes of the files. But there still isn't any compatibility layer between the object store and the politics interface for being able to interact with the same files in a different format. So you don't really get any real benefit from running that system, if you all you really care about is the object interface. And because of the fact that it's adding additional abstractions, you're adding additional overhead as well. And I know that one of the factors that you're focusing on for Mineo in terms of positioning and feature set is the speed as well as the s3 compatibility, which I know is sometimes lacking with Swift. And I also noticed that a lot of your positioning is based around the use of Minaya and object storage for use cases around machine learning and artificial intelligence workloads. And I'm curious how you have seen the overall market for object store for those types of use cases as compared to a distributed politics interface. So that use
Anand Babu Periasamy
0:13:02
case part evolved organically, as we watch the market in terms of the community, how they grew and what all they did with it. They AML if you noticed, if you track the history of the of the project, you would have seen that that happened only in the last one or two years it when we looked at the use cases, it was all over the map. And as open source, you can't really control like what all they use it for. But what we found was that a bulk of the enterprise data was actually sitting in his DFS, and very little on scale out mass in the public cloud. The shift already happened, right? Every major database systems, a analytics machine learning if you look at them from Google, Big Query to Azure ML Power BI SH maker, Jamar Athena, everything you look at, they are built on object storage, private cloud is still sitting on HDFS, and a users are struggling with managing that vast amounts of data and the complexity of HDFS and hulu's. They want their on prem infrastructure to reflect how AWS is built. This is where Kubernetes took the computing side. And they want object storage to be the data management side. And this bro, the AML Big Data applications to mean I was needs. And for us because the performance was there were other object storage systems were eventually consistent, and the performance was not was not as good. They were most of them are built for archival needs, right? Sure. You can put min I want a hardware based system and you can use it for archival, it will be cheaper and faster for for even archival needs. But where it really differentiated itself compared to other projects out there was the performance and business critical needs at simplicity is what the users light. And it was it we naturally gravitate towards that use case.
Tobias Macey
0:15:07
And then in terms of the s3 API compatibility that provides a certain number of constraints where you are committing yourself to ensuring that you have this interface for the object storage. And I'm wondering what challenges you faced in terms of ensuring the completeness of adherence to their API, particularly as it has changed and evolved most notably being the f3 select capability? And then also, with the fact that you are accepting these constraints? What have you found to be the areas of innovation for the product?
Anand Babu Periasamy
0:15:42
So the good and bad part of s3 API? I would say it's mostly good. A The only bad part is that it is not, it is not an open standard API. Right? It is a standard simply because it's the most popular implementation. And I don't regret it, because it is true for pretty much any system out there, whether it's your Apple charger, or pretty much any device out there any standards out there. If you are the most popular implementation, you automatically become the standard. And it's something that we don't control or the industry experts don't control. But I think it's quite alright that being somebody opinionated, like Amazon, driving the direction and controlling it instead of consortium, I'm actually fine with that. Now, the details is where it matters to us that if you look at the AWS REST API spec, the spec is just like a guideline, and how exactly the API is implemented, it is quite nuanced. The problem we found was that if you took different sdkz different tools even develop just by Amazon itself, right, whether it's AWS, Java, ZK different versions of that SDK, if you see from old to new, you will see that the implementation, the interpretation of the REST API spec is quite different, right? I wouldn't say quite, it's maybe slightly different. But different sdkz different language bindings, they actually have different interpretations. Two, and you will also see open source tool, sometimes they even have bugs, the tools that they have bugs and the API on the server side needs to be forgiving. Amazon implementation of that s3 API, Amazon s3 service that is, is very forgiving when it comes to a variety of applications old and new. Hitting the server. Now when it comes to we mimicking exact compatibility, the challenge we have is it only takes one API to break. And it's not just one API, the API's themselves have different variations. And sometimes even bugs, right, we have to make sure that every single detail of it is captured. And we found that the only way you can get to that level of granularity and correctness is you have to be focused. This is where we decided very early on that do one thing really, really well. And it had when we decided that it is s3 API, it was a we never turned back, turn back, right. And the API itself within s3 API, if you notice s3 V for API, we were the first to implement s3 before everybody else either copied our code are they just copied our product itself into their product offering. And since then, anything Amazon introduced, like we will have it right away. Like we mentioned, s3 select, for example, like a s3 select only Amazon has, and we have it if anybody else has f3 select it is simply mean IO in their product. A the being focused with are able to get catch up with Amazon. When Amazon introduced s3 select, we were along the side, we implement s3 select. In fact, before s3 select, we had a more powerful implementation inside me know like you could have flowed like JavaScript a as a spy part of your wrist call. And JavaScript gets executed on the data and your output of the JavaScript attachment would be the output of the object. But anytime in Amazon innovate, we wanted to actually be closer and be compatible with Amazon, we actually we constantly remove features rewrite to be compatible with Amazon, it's only a good thing for the users. Because in the short term, like if I lock somebody to mean IO in the long term, it's bad for the user. That is bad for us, too. So we we stay very close to Amazon s3 API. And everyone claims that they are s3 compatible, but the details Where is where it matters. For us, given the scale of adoption, that we are able to keep up with the SD
Tobias Macey
0:20:10
SD compatibility continuing on the idea of data lock in and providing a consistent interface to make it easier for people to move their workloads between different environments. I was also noticing the Federation capabilities that you've built into Min IO both in terms of federating between different clusters of Minaya itself, but also in terms of providing an API compatibility layer for placing in front of things such as Google Cloud Storage. And I'm wondering if you can just talk through some of the strategies and implementation that goes into that overall idea of Federation and providing those compatibility layers to make it easier for people to migrate their workloads while keeping their client code the same? Yeah,
Anand Babu Periasamy
0:20:54
so the part the compatibility part, like an API gateway, but it was relatively easy for us to do. Because of the design that we adopted inside, we knew that even area code, to better to all those capabilities were at an object level granularity. So all we have to do was to was to make the record module pluggable. so we can write other storage adapters, and make everybody else look like Amazon s3 compatible. But the story behind how it happened is the interesting piece is a Microsoft actually was the first to ask for this feature, that Microsoft dot one Microsoft came to us and asked that can if we can make their blogs to s3 compatible. And I'm like, they showed them the data that how many, many instances were running inside Azure. And we were actually the I think Azure was the third most popular deployment base for me know. And second is Google Cloud. And in fact, number one deployment basis Mineo is actually Amazon, AWS easy to EBS. Yeah, there is like a 600,000 Plus, I don't know 700,000 plus unique IPS of Mineo running inside Amazon itself. Now when people are running inside public cloud, like I asked, I showed Microsoft that, look, there is already a lot of money running inside Azure. So when people wanted s3, compatibility, they just run me and I want your cloud. And what Microsoft wanted was something different. They told me that when I were running on top of the EBS, actually on Azure, it was file and block share that was not very useful for for them. The reason being, if you uploaded data to mean I will, on as you're running on a file or block share that data is, is while you can read and write to s3 API, you can't run as you demo, our Power BI or any other Azure cloud service on that data, because they don't SD API. But what was interesting to us was, they told that they can't even read the data sitting on their file and block, they are all built on object storage. And what Microsoft wanted us to do was to make mean I will store the data on top of Azure Blob store. And for me, it's like, wait a minute, right? what you're asking me is to put object storage on top of object storage. Why would I do that. And they explained to me that all other cloud services inside AI inside as your only peaks blob API. The second part was, they treated file and block to be a legacy. And we storing the user data on a file and blog was meant to be an enterprise legacy compatible compatibility layer. And it made sense to us. And besides, we also saw that they need to make every other system that is not s3 compatible to become s3 compatible, is it it's also very important problem for us, right? Today, the biggest problem that industry is facing is applications are still speaking legacy API, they need to be rewritten to be cloud compatible. And a every cloud being a different API is not helping it. If we made everybody look like s3 API, more applications will be speaking as the API and it will enable the migration to to whether Kubernetes or mean I will or public cloud are pretty much you want to stay with your existing even legacy. Like say, your NASA science and systems we can make even them look like s3, even HDFS can look like s3 back Blaze, Oliver workload the Google Cloud, we made everybody look like s3. And now it's a what it did to the industry was made a huge amount of applications IAA adopt ST API as a standard, while one side, Amazon helped tremendously, we actually help the private cloud market adopt s3. Now that there are more applications. And they also see there have been built on me and I will in most cases, and when they go to production, they just carry us along with them.
Tobias Macey
0:25:16
That's funny that it was actually a request from Microsoft that led to this feature. And I like the point that you made as far as the s3 API is more than just the storage because by allowing people to interface with these other object stores, such as in the as your case, where that's all the other services, know how to talk, or in the Google case, where you're able to load data into the Google Cloud object storage, and then from there directly loaded into Big Query, it simplifies adoption of not just the storage system, but also additional services and capabilities that you wouldn't necessarily think of at first.
Anand Babu Periasamy
0:25:54
And you can see right there, it doesn't really help us at all, we are actually helping Google Cloud are like public cloud services. But I think in the end, if you do what is right, if you are altruistic, it actually pays off. It is profitable to be altruistic.
Tobias Macey
0:26:09
One of the challenges I'm sure that you faced on that front is how to map some of these s3 API is particularly the Select capability on to the other object stores that don't necessarily have them natively. And I'm curious how much effort that was and how much involvement you've had from those different companies in terms of either working with you to add those capabilities to your code base, or in terms of modifying the capabilities of their platforms to make it easier for you to implement.
Anand Babu Periasamy
0:26:41
So there are actually some might some my new details that matters, like say, for example, whether it's encryption, s3, select, like features like that. They we the way we did it in the gateway layer is if if the backend does not support the functionality, and if I supported it, then if you directly go to the back end, can you read the data? Right, this is actually the this detail matters a lot, what users really want is, we have to be at transparent layer on top of the existing system, whether it's an existing cloud storage provider, or a NASA at like nsmen point, or even your GFS, even sand vendors do that. What they really want is your existing data. So if you want to manifest volume, if you already have, say, two petabytes of data sitting on your manifest volume, you're not going to migrate the Data Domain IO, if I write the data in a proprietary format on your end, Fs, if you can access the data on on s3, meaning through min iOS s3 compatible object, but if you go to the backend directly, it will be all binary blob type data and you won't be able to access them. That actually is not a good idea, right? So what really want people want is, is what data whatever data they put through Manero is written natively like a file on an Fs, or a GCS object in a Google Cloud. Or if it is a blob in Microsoft Cloud, we need to translate it and write it natively to the to the backend. Now. Now when we do that, some features that the backend does not implement, like say, for example, even the encryption API's that Amazon has SSE server side encryption clients upload keys, and SSE s3 and Kamus, I can't really do a full translation to the backend. Now, if I encrypted our gateway layer that catches that now I have encrypted with me now. Now, if you go to the backend directly, if you blob APA won't know how to decrypt the object that be encrypted. So how did we handle cases like these, so we actually implemented the functionality. In cases where we can translate it completely by some whole masking it, we actually did it transparently. But we never wanted to write it in some proprietary format encryption is a case there, were actually some of the large financial customers, they wanted the data to be encrypted before it left their their data center. And they are they were using public cloud just as a Dr. Copy. So it made sense for them to use the encryption feature. But for the rest of them, generally, what I tell is use the features that like object storage has now nowadays object storage has like it has more features than you need. generally don't pick features that are that are very specific that if you move to some other place, if you move to Azure to Google Cloud, and if you are going to miss that feature, you it's okay for you to not use it be it, it's really important to not look at some features that that will hold you back hostage. If you use it, you have to have a very strong reason, right? In our case, what we did, in the meanwhile, being a complete alternative, we were able to catch up everything that Amazon has that the rest of the world needs. In fact, some cases that like warm like capabilities, we were able to add add more without breaking the compatibility. But for the for, for the gateway part, that it's not possible for us to mimic the cert like a subset of the API's without breaking the idea of writing the blog writing the data in a native backend format. So we implemented them as options, it then the users are pretty knowledgeable. They know how to turn it on and off depending on their needs. And we have good guidelines on how to how to make these choices.
Tobias Macey
0:30:53
In terms of the actual implementation of Min IO. I'm wondering if you can talk through the overall system architecture and some of the ways that it has evolved. And then also some of the additional services that you've built alongside it, because I know that you have replicated some of the functionality around the key management store and I am our identity and access management and just some of the overall sort of tertiary projects that have come to be necessary as you evolve the overall capabilities and reach of Mineo.
Anand Babu Periasamy
0:31:26
So this architecture is where it fundamentally differentiates it is also the reason why we we saw that that object storage was the week was the right starting point for us. If I am going to implement eight another textbook theory of object storage, it wouldn't be any better, right? Other than cosmetic reasons, what I saw was that when you look at the s3 API, it is simply a get put lyst type atomic immutable web service at da right. And if that is the case, your entire storage stack could be just reduced to a web server. And what mean I will is if you have noticed, like a just you just download the static binary and run. And if you have a distributed set of machines, you just download that same binary tall the machine and type same exact command line and they all clustered himself up there is not even an installation process, right? It's a static binary download and just run. That is where the real differentiation starts. It is a web server at its heart, all of the storage layers got collapsed into a single layer. Everybody else, if you look at the design, there is a distributed block layer that handles area code or application or whatever, then there is a virtual file system layer. And they have a multi protocol API gateway on top. And this is where they get it wrong. Right, they end up actually following the same old file system theories that are that are not applicable today. I felt like the same argument that when bluster was implemented in user space, even the colonel hackers thought that user space file systems were toys, and we would never be able to give a meaningful file system. But in every benchmark, we showed that they were faster than the kernel file system. There were no good Colonel days file systems out there. And now I see the same thing got even more simplified. An object storage server is actually not a file system with the API gateway that speaks f3. Instead, it's actually an object storage server is a web server in a distributed object storage system is nothing but a collection of servers that are stateless, distributed and cooperating. Right? So we by collapsing everything into just a web server, and all of the storage functions like Then what about area code locking, distributed? Like a better check, you know, encryption, all of these functionalities? Where do they go, there are simply web handlers attached to the web server, literally the matter, as as soon as the entry comes, the API entry comes in, decodes the s3 API, mostly they are XML content in as part of HTTP requests body, it translate that into a generic object API. And then when the data comes, it simply atomic blobs of the object, you erasure coded or better protected your encryption or whatever you want to do. They are simply transformation stateless functions. Once an object is he recorded, you got collection of blocks, and you scatter across other web servers, it's a really simple design. And that is what made me and I will be a very different breed. Because there is no multiple layers to have, there is no metadata database. And it is a collection of web servers with a cooperating distributed lock. And this made it very resilient, very high performance. stateless, you crash a system in the middle of busy workload, you won't lose any data, there is no caching, there is no eventual consistency, everything is committed. In fact, we even write to the disks with all direct turned on. So we are not even relying on the disk buffers, right? You could lose power on the data center and data can get corrupted because FX office did not journal the data, we don't want to run into those kinds of problems. It's really a simple architecture and tact is what made when I would have heard
Tobias Macey
0:35:28
So can you talk through the overall clustering strategy that you use and some of the way that you actually manage the file metadata, given that you don't have a centralized storage layer or a database for being able to reference that.
Anand Babu Periasamy
0:35:42
So the centralized data layer came from the legacy idea, right, a file system historically meant translating a project like operation into into collection of blocks. And then once you break them into collection of blocks that are mutable, then you need to have that association of what blogs make an object and what all objects are in which bucket. That's where the meta data comes right, we actually did not find any such need a they when the when the when the blob comes into the system. So within the maniacal architecture, this concept is like this every 16 Drive, this is a default setting you can override if you want, every 616 Drive is a ratio set. And then when you have a collection of machines, a cluster isn't a server set is a side website is a cluster and a within the server set say if you have 32 servers and each having 32 drives you have thousand 24 drives right now, what mean I will does is it deterministic Lee picks 16 drives, making an eraser set and a when the data when the object comes it actually breaks it into into parts and then places them on the 16 Drive set. Now a collection of objects in a bucket, it actually picks different different drives sets. And whenever the object comes in the location of the object, in terms of which drive it has to go to is simply a hash mode operation on the object name, object names are, are unique across the namespace and hash mode operation deterministic Lee places the object on the same exact 16 drives at each time. And now when the request comes to any one of the server, each of the server has the same logic. It's simply a deterministic lookup algorithm, which it doesn't have to check a metadata database, and also all object, object data and metadata written together. By doing that you actually have a significant advantage in terms of resiliency and performance. Because you don't need to hold a lock in a in a central system. Here we are the whole almost a global lock, because you're you need to make sure that the metadata Database is updated, we will have no such requirement, each object is granular and completely parallel. There is no global lock here, each object because object names are unique, it holds an object level lock, and then takes the data. Often the data comes in bits and pieces like get object arrange or put object multipart, all of them are committed atomic Lee backed by area recording and scattering across the drives the location API simply deterministic. So within a cluster, this is how it behaves. Now what happens when I have multi data center multi like multi cluster that is when different different regions behave like exactly Amazon AWS regions, depending on the bucket location, your request gets forwarded to the right cluster. And within that cluster, the every one of the node in the cluster is fully symmetric, the there is no metadata name, or a name node like a property here, right? every node is equally capable, once your bucket location tells that this is the cluster where your bucket is that cluster, any one of the node knows how to how to exactly serve the data, it may be a little a little hard to grasp without a picture. But think of it as you wrote a simple web service for file uploader and file Gator, right, you would find that you will actually gravitate towards this kind of model very closely. And the only difference would be if you don't have to remember the the location in a database. How would you do simply use a deterministic lookup algorithm?
Tobias Macey
0:39:39
So how do you characterize Min IO in terms of the axes of availability and consistency on the CAP theorem? And how do you handle network failures in that clustered context to ensure that the blocks that you're trying to write to based on these modular hash operations are going to be accessible at the time? The commit?
Anand Babu Periasamy
0:40:03
Yeah, so when I designed the system, I never thought of the CAP theorem. based approach, it makes sense when you are actually doing a replication model. A were in when you are doing a scale up or a replication, like where these problems comes into the picture. And you need to make a hard decision. In case of men, I will just to map it into the into the CAP theorem, what is our model, I would say that it is CPE. A consistency is is actually very important that most people don't realize, eventually consistent systems just before the data is propagated the drive that hill holds back, say uncommitted data is not a ratio recorded or it's held in the cash. If that drive died, the odds of that drive dying is higher than the other drives, uncommitted data leads to huge operational support problems. And it is really important to actually be done with it and let the application know exactly what failed. And we took consistency very seriously. And also partition tolerance, I think it is, it is no go, you can't really forgive, you can't be forgiven if you make mistakes there. Because in any system, it's okay to actually fail, but not correct the data, right any storage system, this assurance has to be there. So the partition tolerance and consistency is super critical. Now, when it comes to availability, now here is where I think the CAP theorem is, is not well understood by most architects that not all are equal all the time. It depends on the type of distributed system you design. And in our case, the availability part here is where a ratio code comes in handy. And a combination of a ratio code on the s3 API itself, that the concept of s3 API being atomic, an immutable innate nature application remembers the context, right? And object storage system, every operation, as long as I committed, ultimately, and I continue to keep that atomic city, then I'm good to go. Which means that I only need to take care of metadata updates, and data updates, because there is no metadata database, they're written together atomic Lee, I'm able to actually solve these problems more elegantly. Now, how are we dealing with availability problem, then here is where when you take an object, break it into multiple parts and spread across 16, drives across 16 servers, you have plenty of parity nodes, if I have like, say, for parity, I can lose up to four nodes on my data is still available. Because a ratio code and the very nature of s3 API and how me and I was designed, we got plenty of availability to to withstand failures. That's the part that it is completely Alright, to, to compromise compromise to be did a CD system
Tobias Macey
0:43:01
for somebody who is deploying Min IO, what are some of the operational characteristics and server capabilities that they should be considering? And in the clustered context? How is the load balancing handled? Is it something that you would put a service in front of the cluster? Or is that something that the nodes themselves handle as far as routing the requests to the different nodes within that cluster for being able to ensure a proper distribution of the data.
Anand Babu Periasamy
0:43:28
So the the way we designed it, in my view is that the nodes themselves should handle all the routing. But it's not always the case, right? So when to do you really need a load balancer, or a service mentioned in front? It depends on the use case. So I find that when people deploy this in case of the ML like data processing workloads where they need high performance, every hop counts, and usually these applications are running a lot the side of me and I will over hundred Gigabit Ethernet and envy me, when when they do that you want the fastest path to the data and shot us to pass to the data and envy, me and SSD are all aboard. How the first that time to first byte. And in that case, they go directly. And any one of the nodes you hit, they all have the same capability. And it's fully symmetric. So you, I wouldn't recommend putting a load balancer There. Now then how do they discover which node they want to go to? It is simply a safety of a, a 16 node cluster or 100 node cluster, you simply have these node names at map as DNS, and DNS, simple round robin DNS based approach. The clients are spread uniformly across all the servers and they do the job. And s3 API is quite resilient in terms of even if their server restarts or anything they actually require. HTTP itself is is stateless interface right? Now, why do you then need when do you need a load balancer or a service mesh in front, and when you are actually building an application use case where you are using this like a photo store or some mobile application data that they are accessing across the internet, you so what I commonly find is they want to have a layer that does SSL termination, like bandwidth throttling, there are multiple other key reasons they like routing and service discovery. They say there, I actually see almost like two classes are emerging here. The slightly old school, I don't know if I can call them old school, but I it is the most common approach is a load balancer approach, the emerging one, particularly the large scale ones are actually switching to a service discovery model. Here is where is to envoy type. There are multiple solutions out there in the in the market, smart stack, they're actually evaluated bunch of you here. Few are building very large stack, they're actually the the service discovery part actually takes care of scaling. Like as you add new clusters, they get registered automatically an application to discover these capabilities. But then when it comes to actually hitting the data, they don't want to go through a load balancer. This is where like envoy like sidecar proxy brings the load balancer almost like a personalized proxy, closer to the application. These are some really cool emerging ideas a if people are building large scale application, like SAS applications, they would go this route.
Tobias Macey
0:46:47
And in terms of the project itself, I'm wondering how you approach the ideas of project governance and sustainability and just some of the overall strategy from a business perspective of how be open source
Anand Babu Periasamy
0:47:01
Yeah, actually, this is the one that close to our heart, in you know, that while we we are doing startups, the reality is for us that it when we did last or last time, the team came together because they were passionate about about open source. In fact, a while I call open source, we are the Free Software guys, right? You know the difference. And the the new prison speech not as in beer. Exactly right. And a, it was out of passion that it grew. But when you do when you when you build products, with passion is the attention to details and craftsmanship, they get noticed. And that is how it led to success. Not because we are a bunch of entrepreneurs, we wanted to make money. And we got together and figured out what will make more most money. It didn't happen like that. And because we were passionate about open source back then it I call it open source simply a it's a company kind of hurts me when I say open source every time. But when I say free software, a being a purist, very few people really understand it. But the truth end of the day shows up right? The cluster is the case. And when I was a case, you see there is nothing proprietary here it is 100% pure open source or free software model. And why it matters. like back then customers used to ask me like a open source means it's inferior. is it's not secure. Know, they used to ask all these questions, and it has come a long way. Now you can see even VMware has to look like overnight is an embrace Kubernetes every large deployment, particularly in the data space, that we find customers actually are mandating open source for their infrastructure. Because they have seen even big companies killing products shutting down every few years on these products that enterprise depend upon with open source, you can easily hire top talent across the world. And it is it's no longer a problem right now when it came to governance that for now, how do you how do we run our work across a compared to other other open source projects, we took a benevolent dictator model, right. In fact, it's not a pure benevolent dictator, I would say it's like benevolent dictators, all the way down. A My job is to make sure that the I continue to grow more and more leaders, everyone in the team is empowered to make decisions within their scope. And a it The hardest part for me is it's like you have one giant canvas and you have a bunch of artists. And our goal is to create one piece of artwork, right one mind, it is not quite, it's not easy, they, but you have to be opinionated. And you have to bring everybody to see the same vision. And this has to extend beyond our team into the community. Right now we we we are able to see the scale at which we grow like like 324,000 downloads a day, maybe some of them are ci CDs, but by all measures our like saved 3300 members on our Slack channel, every every like 18,000 GitHub stars, any number we look at, we have deep penetration. But the key here is we are not an Apache Foundation, our Linux Foundation project, we what I found was that the moment you bring in a consortium on the board, you have multiple chefs and then every vendor has their own commercial interests, it becomes really hard to drive a project, I found that being being a benevolent dictator model being up needed. If you are true to your costs, you will be able to build a powerful community and the community showed faith in us that they don't need a third party nonprofit organization to endorse that we are, we are a company that we can be trusted. And they have seen this even in the past, like bluster, and no longer involved, but it is a project that is still thriving without me. Even some of the core members, right? That is the one that's giving users a lot of confidence that we understand what we do. And ANVBV for us open source is not a business strategy. It is a philosophy we believe in light, and they are able to see through that. And combined with that you also have to build product with with the craftsmanship, like thinking like an artist, and minimalist culture. If you establish the culture, that culture is the one that actually brings the right kind of people together. And once you once you bring those people together, then you can't stop it, right it it becomes a way of life for them. And they get emotionally attached. Some of these guys that wherever they go, they are almost like attached, emotionally attached to not only mean IO, everything they do afterward they want to do it like this. That is the one that is giving the stability for governance and the stickiness and sustainability for long term.
Tobias Macey
0:52:23
Yeah, it's definitely great seeing businesses that have that understanding of the ethos and culture of open source projects and ensuring that they're actually in it for the technology itself. And not just for the business opportunities that come along with it.
Anand Babu Periasamy
0:52:41
Ya know, we are kind of fortunate now that because the industry has come a long way, you don't no longer have to convince the customers that why open source and free software is great for you. Now they are telling us that is all they need input. In fact, like the surprising part is even investors are actually in a favoring open source, particularly in the enterprise space, infrastructure space, they favor open source over proprietary startups, they even investors are advising these proprietary software startups, entrepreneurs to actually consider open source. And problem is that you have to believe in it right? It's not something that you think of it as a business strategy,
Tobias Macey
0:53:20
because of the fact that you have gained this measure of popularity, I'm sure that there have been some interesting use cases that have come about and I'm wondering what you have found to be some of the most interesting or innovative or unexpected ways that you've seen mid IO used.
Anand Babu Periasamy
0:53:34
Okay, that's actually the fun part is I can say this is a true private cloud use case like literally, this is a mean IO running inside Royal Caribbean ships, because they have no connectivity. I've seen even similar cases like movie production, a the onsite, they have a they have to capture. And they also the the raw videos are so huge, and they need to, they need to do editing and, and processing on prem. And from from cases like that, like the interesting the one that surprised me the most was that sounded 1000 unique IPS of when I was running inside AWS. I don't know how many of them are seriously like growing. I actually I kept telling these users that they should be using AWS s3 when they are inside AWS, because they are running us on top of EBS, EBS, I think it's like three times more expensive than s3, you are not really saving money by running on top of EBS block storage. When I spoke to these users what they are, what they were telling me was that they wanted cloud portability, they have fully automated their stack through Kubernetes. And they are able to burst their deployment into public cloud, a battery, the CI CD application development, the are able to move between clouds, it is portability. And convenience is more important than the cost. But I but I would if you asked me, I would still recommend that. If you are if your data is growing AU then you either stick to s3 inside AWS, if you are outside AWS, then I then mean I will makes more sense. And then the one that recently came that was quite interesting to me was the 1010 use case. 1010 is like Tinder of China. And they have like insane amount of really small files. These are like sub hundred K, and the largest they can ask is like 200 k type object. And when you have like few kilobytes worth of objects, you should really not be looking at object storage, you should be storing them in a database. But the problem is that they had better by itself such data and no database would scale that big. So they what they did, interestingly was they changed our drive Fs layer with with the RTB. So they can actually a store petabytes of really, really small files that should actually be stored in a database. Instead, in this case, it's stored as objects. And the objects are indexed as a collection of rock, compacted rocks, db databases, and essentially mean I will doing erasure code which brought all those capabilities with strict consistency across many nodes at better scale. This is almost like a distributed database with a s3 API. That actually surprised me. If they had asked me on their own, I would have completely discouraged them. And this surfaced after they were running us in production at scale. And they published a blog post recently. And it came as a surprise, and I'm actually encouraged by it. The innovation that they did, then I also see from medical imaging use case usually they are the one that is like most conservative. All the BNR VNA, vendor neutral archive and tax data, they're storing it, like 1215 larger largest banks in the US are using us, like 13 or 15. In the Europe is using us we we we found that the organizations that are supposed to be conservative, are the ones adopting open source object storage. I think that they have come a long way. And also I think they're struggling with the data explosion problem.
Tobias Macey
0:57:28
Yeah, it's definitely interesting sticking a file inside a database inside an object storage to be able to query it at scale.
Anand Babu Periasamy
0:57:34
Yeah, you know exactly what I mean.
Tobias Macey
0:57:39
In terms of the future direction of Min IO, I'm wondering what you have planned in terms of the product roadmap, and any additional projects that you may build to incorporate with the object storage and provide additional capabilities.
Anand Babu Periasamy
0:57:54
Yeah, so some of the newer things that we are working on, like more and more is see, even the databases are coming to object storage. That actually also was a surprise to me. Initially, I thought databases and object storage or paddles, they should be running side by side. But because even the small data is growing big, from blank to vertical Teradata, in the in the proprietary world to open source, you will see all the way from presto to spark ml to all that like drill. All the databases are turning to object storage, they're leaving their storage back end to object storage and working on just a sequel, processing and query engine. So that working with these guys in terms of integration validation there, that's just a matter of ecosystem, right? We're in terms of enabling more and more richer applications to come to object storage. But in terms of features inside and in Iowa itself, the most important thing that that I care is actually support ability, support ability, the name of how do you really operate a large infrastructure at scale with very little expertise, as you scale bigger and bigger if you need to hire more people than you are not scaling at all? Right? So how do you then do that? This is where I see that support ability is not a separate problem, it is very much a product problem. And you would have seen, like even some, like recent features, like simple things, but they are very powerful features, like MC admin trace, it is like the database type. But for an object storage, that when you run a trace, it's as simple as like admin trace command, you point to a running system, then every single operation that is going on at that point, it will give you the details. We anytime an application has some bug or it's a bug in our code or something like this is where I see compatibility, right, if something that we broke, or some new application is using some kind of legacy APA that are there in working it in the wrong way, run admin trace and immediately we can tell but in their code bug in our code on what went wrong from that to like even console log you remotely just attached to a running system. And it gives you an entire history of the console are simple things from that to how like you can detect slow drives, bad network, they are all not someone else problem, right with almost always that you You will then see a slow dry batch of news drives are slow, you would actually end up blaming Min IO that mean I was low on time or timing out or an unstable, then this you can't really throw people at this problem to to debug and troubleshoot is not a scalable approach, right. This is where the support ability capabilities that we are adding into the system will actually enable not only the users, but also us to reduce the burden at scale, that they that we at this pace, we make new releases every week, sometimes even multiple times a week, we are able to move fast and almost operate private cloud deployments on par or better than public cloud deployments, these guys are are able to run large infrastructure with no previous storage experience, right? That is, to me, enabling them as part of these tooling is super important. And then followed by other features that we are adding a subnet subnet is how we can remotely help these users Lifecycle Management, the ability to expand clusters on demand or things like that. But in general, I would say we are the anti roadmap company, I always tell our our architects and our users that if you sent a pull request to remove a feature, I will immediately accepted as competitive adding asked 10 times before you really really need this feature, because adding is easy, maintaining is expensive. We try we try very hard to not not make it feature rich. And that will be the that will lead to the collapse eventually, right. So being very particular about making sure every feature out there is a very important part of the system. So that's about it.
Tobias Macey
1:02:23
Are there any other aspects of the work that you're doing admin IO, or the overall space that you're working in, that we didn't discuss yet that you'd like to cover? Before we close out the show?
Anand Babu Periasamy
1:02:32
I think, in general, I, I see that, like object storage has come a long way, I would like to see, like more like tools on top, like, for example, better data governance. Data Governance itself, I think is a is a data governance, data management are kind of old school, thinking they like it's not I'm not thinking in those terms. The real problem is actually your data is at such large scale and is constantly changing. Now ability to control access a defined policies, even establishing identity, like those days, you want, like you want to give access based on user IDs. Nowadays, there are no users, these are applications and you need application identities. Applications cannot actually do two factor authentication, right, they need to be doing certificate based authentication from things like that to how you define application policies and control access to data and, and API's. A lot has changed in the recent times, and industries far behind in terms of their understanding of how to manage data on object storage, how to establish these identities, and then also even discovered data, right? When you have data getting generated faster than all of your past data. How do you then even organize this data, cataloging them AR indexing them, putting them in in like organized folders and buckets is just going to be impossible. The way then you have to look at it is think of this problem as a search problem, right? Like if you have too much of data, you can't organize them as folders, then all you need is a better search engine, better access mechanism and policy control all that, I think there is there is a lot of room for for these kind of powerful tools to emerge on top of the data management and data storage system, I would like to see more new projects or startups going out to them.
Tobias Macey
1:04:46
For anybody who wants to follow along with the work that you're doing and keep track of the work that you're doing. I'll have you add your preferred contact information to the show notes. And you've touched on this a little bit just now. But I'd still like to get your perspective on what you see is being the biggest gap and the tooling or technology for data management today, if you have anything else to say on that matter.
Anand Babu Periasamy
1:05:06
Yeah, I think the like to summarize that it's the tooling is this is the search part, the access management and a in terms of policies access management. In the past, because you had many different storage systems, like trying to do unified data governance on top of a variety of like San NAS vendors and database vendors was was hard. This is where the data lake got all the bad rap, right. And the new world. The good part is all of the data is getting consolidated to object storage. And there is only one storage system at the heart of the data infrastructure. And everything else is simply stateless containers and BMS around object storage. And they are all accessing through s3 API. If this is the case, finally, data management is something practical, right. But then the key here is the old school data management model does not work here anymore. It has to be fundamentally rethought in the form of a search and access platform. I haven't seen a good product yet in the market. But certainly I keep hearing new startups wanting to go after this
Tobias Macey
1:06:26
market. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Min IO. It's definitely an interesting project and one that I have been keeping an eye on for a while and look forward to using for my own purposes. So thank you for all of your efforts on that front. And I hope you enjoy the rest of your day.
Anand Babu Periasamy
1:06:43
Oh, thank thank you the way is this is a common pattern we see like users users in their home NASS or just the home drives and also in the in the corporate and large scale machines. For me that that is the real proof for me that it has delivered what I wanted when it is simple enough for a personal use case, that level of simplicity is what is needed to actually operate a very large infrastructure. Right. And often I see that a that when we talk to our users, it isn't it is even running on their laptop and home NASA plans
Tobias Macey
1:07:22
school here there. Thank you again and have a good rest of your day you do is
1:07:33
listening. Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways it is being used to visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast com with your story and to help other people find the show. Please leave a review on iTunes. Tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Open Source Object Storage For All Of Your Data 1