Behind The Scenes Of The Linode Object Storage Service - Episode 125

Summary

There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of the current state of your object storage product?
    • What was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze?
  • What is the scale and scope of usage that you had to design for?
  • Can you describe how your platform is implemented?
    • What was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch?
    • How have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public?
  • What have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode?
  • What are the most important capabilities for the underlying hardware that you are running on?
  • What supporting systems and tools are you using to manage the availability and durability of your object storage?
  • How did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage?
  • What are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams?
  • What are your thoughts on the state of the S3 API as a de facto standard for object storage?
  • What is your main focus now that object storage is being rolled out to more data centers?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's Li n o d today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing Will Smith about his work on building object storage for the linode Cloud Platform. So well, can you start by introducing yourself?
William Smith
0:01:09
Yeah. Hi, I've been with linode. for about five years now, I originally worked on our transition from the Zen hypervisor to KVM, which was a super exciting project. I since then moved on to working on the new API that we launched two years ago. And after that, I
Tobias Macey
0:01:26
was moved on to the object storage project. And do you remember how you first get involved in the area of data management?
William Smith
0:01:30
Well, linode does a lot of data management, what with hosting virtual private servers and everything. But specifically with object storage, they already had a prototype developed of an object storage cluster based on Seth. And they just wanted to bring a team of developers into productize that which is kind of where I came in.
Tobias Macey
0:01:53
And so you mentioned that you started off with this f prototype, I'm wondering if you can just give a bit of an overview view of the current state of what you have available for object storage on the linode platform. And some of the motivating factors for building it out and managing it yourself rather than building an integration with somebody who already has object storage as an offering, such as wasabi or backblaze.
William Smith
0:02:18
Absolutely. Right now, our object storage service is available in newer and Frankfurt and we've got more locations planned, we offer a fully s3 compatible API. And that means that it can plug into basically any tool or service that already uses object storage, which is fantastic. We also offer static site hosting on the platform. It's got full integration with all of the nodes first party tools, and there's a promotion going on right now where we're giving it away for free until May. So if you haven't used it, you should check it out. As for why we built this out ourselves instead of partnering with someone. We had internal uses for this and we wanted to integrate it into our platform so that we could use it however, We wanted install fit without any restrictions. We also already had a lot of organizational expertise in hosting Sef, because it's what powers our block storage product. So it just made sense that we already ship hardware to data centers all the time, we would stand up clusters ourselves and integrate it into our platform for customers, in addition for to for us.
Tobias Macey
0:03:18
And so given the fact that you are already using Ceph, for the block storage, it seems pretty obvious that you would end up using it for the object storage capabilities as well. And wondering if you can just give a bit of a discussion of some of the trade offs of using Sef for the object storage as opposed to something like min IO or one of the other available projects, but just building it out yourself from scratch.
William Smith
0:03:41
Well, building it up from scratch was never on the table because we wanted a much faster time to market. And that would allow we didn't really consider alternatives like Mineo because we already had a team that knew how stuff worked and how to administrate it pretty well. So once they set up the object storage gateways on a new Sef cluster, and saw that It all worked pretty much how they expected, it was a clear
Tobias Macey
0:04:02
choice that we should use it. And in terms of the scale that you're building for, and the scope of usage that you had to design for wondering how that impacted the overall design and testing and implementation of the rollout of the object storage capabilities.
William Smith
0:04:20
Absolutely. So we wanted object storage to be a cloud primitive available with linode. So when you sign up for an account with our platform, it's just another tool in your tool belt. And that means that it needed to support a broad swath of use cases, everything from just hosting user uploaded files, to long term data storage or big data with a lot of movement and IO to static site hosting that might let new people join the platform without necessarily needing to spin up and maintain the server. But that presents an interesting problem, which is that as a consumer of object storage, it appears to you as if the buckets you create have unlimited space for you to upload up to Next, when in reality, there is actually a physical server that only has so much space to store things. So a big part of the initial effort of the project was trying to figure out how we could scale the clusters to keep up with demand. And we played with a whole lot of things we played with scaling Sef the way that they recommend you do so. And there's a big paper that CERN published in, alongside Sef that claims to have scale that up to thousands of nodes and hundreds of petabytes of data, we played with crazier things like scaling multiple Sef clusters together to appear as one cluster to a customer, which was fun and exciting. But ultimately, the problem is pretty much the same as just maintaining availability for new servers in a data center. When a customer requests a virtual server, we need to make sure that we have one to give them all the time. So this is just a different version of that problem that we already work with all the time.
Tobias Macey
0:05:52
And in terms of maintaining compatibility with the s3 API, I know that Sef out of the box has that capability, but What are some of the issues or challenges that you see as far as being able to keep up with some of the recent additions to the s3 API for things like the s3, select API or anything like that?
William Smith
0:06:11
I think for now, we don't intend to support anything that isn't supported by the Ceph. system. Their project is very active. And they do add new layers compatibility all the time. So any work that we spent trying to implement that on top of their system would probably eventually be redundant anyway.
Tobias Macey
0:06:27
And so as far as maintaining the installation, have you had to do any customization of Ceph itself to be able to support the scale that you're building out and has that posed any challenges of being able to stay up to date with the latest releases,
William Smith
0:06:42
we are running our own custom build of Ceph. Most of the patches applied to it were just plucked from upstream to fix bugs we encountered. And we encountered some pretty interesting and surprising bugs during our rollout like when we initially launched the cluster internally to test bucket Name validation wasn't working. So you could create buckets that weren't valid domain names, which broke the DNS that we wanted to have for all buckets. Or we found an ACL violation issue, which was pretty severe and let you modify other people's resources if you knew just what to do, which had been fixed on the master branch but not ported to a release yet.
Tobias Macey
0:07:21
And as far as the initial assumptions that you had when you were first getting involved with this project about the operability and maintainability of Ceph, as the underlying layer for object storage, and then as you have progressed through the various levels of testing and release, and the general release, how have those assumptions been challenged or updated?
William Smith
0:07:44
Yeah, I mean, we started with a pretty good understanding of how Sef works at scale because the block storage system was rolled out many years ago. So we had a pretty good idea of how to build out a cluster and keep an eye on it and monitor it for health and problems. The biggest new component was the rados gateways which we had not worked with before. So if you're not familiar with the architecture of Ceph, at the heart of it is a system called rados, which is the reliable autonomous, durable object store. And it basically has three front ends that they provide Sef Fs, which just lets you mount a chunk of Reto storage as a file system, the block storage front end, which we use to power our block storage product. And then the rados gateway, which is the s3 compatible API that lets you access the underlying righto storage in that manner. So we've already had rados clusters and maintained them for a long time. But the rados gateway is a totally new piece of infrastructure for us. And they've presented their own challenges both in figuring out the right number to have to get the kind of response times that we wanted without getting clogged up and how to to route traffic efficiently to them, in order to have static sites and stuff, you need a separate gateway that serves HTML instead of s3, which is an interesting design decision and presents a unique challenge. And that is traffic comes in, we need to decide which of the two sets of gateways we want to send it to. We solved this problem and many others by having a smart proxy sit in front of Sef and facing the internet. And in order for it to work the way we wanted, it needed to be scriptable enough to examine incoming traffic and decide where it ought to be routed to. But it also needed to be fast enough that we didn't incur a huge amount of overhead on every request coming into object storage. So while I initially prototype something in Python, it was way too slow. It added like hundreds of milliseconds per request to everything that you did, and that's just unacceptable. So after doing a lot of research and playing with several things, we landed on a piece of software called Open Rusty. And if you're not familiar, open resti is basically just nginx. But with a bunch of plugins compiled in modules in nginx are statically linked, so you have to add them at compile time. Although you could build nginx with these modules yourself or with any selection you wanted. This distribution is very nice because it comes with everything packaged together in a state that is tested and works and is used throughout the industry. And the most powerful piece of that is the nginx Lua module, which allows you to execute any Lua code that you write within an nginx request context. And doing that we were able to have requests come in and decide this is looking to talk to the HTML gateways or to the s3 gateways and route it with very, very little overhead per request. It also solved other problems we had like, it also solves other problems we had, like enforcing rate limits per bucket, in addition to per remote IP address to prevent someone from trying to take a single bucket offline through too much traffic. Or monitoring usage since many of the clients are going to be talking directly to s3. Instead of going through any other system of ours, we needed a very low latency to capture that traffic and make sure that we knew what was going on.
Tobias Macey
0:11:14
And the quota limits is another interesting challenge that you're facing that isn't something that somebody running their own object storage system on set for Minaya would really have to deal with unless they're in some sort of enterprise context. And so I'm wondering what you're using for being able to handle that metering and quota limits for people who are building projects that are leveraging your object storage.
William Smith
0:11:38
Well, we do use EFS quota system to keep track of and add caps to usage. But our approach has generally been that if you're a regular user trying to do normal, non abusive things, you should have the availability ready for you and unless you are using it at an exceptional scale, you You shouldn't know that there's any cap. But if you are using an exceptional scale for a valid use case, all of those knobs are tunable by the support team. So opening a ticket when you hit the quota is probably enough to get it lifted to a point where you could do whatever you want.
Tobias Macey
0:12:17
And you mentioned that you had to build this smart proxy for being able to handle the appropriate routing for the native object storage versus HTML requests. And you're using the quota limits for handling the storage capacities. But I'm wondering what other supporting systems or additional tools you've had to build around the object storage solution to be able to manage the availability and durability of the project as well as any ongoing configuration maintenance and deployability of it?
William Smith
0:12:47
Absolutely. So we had to build an internal like very back end service, which basically administers Sef clusters for us and the handles creating credentials on demand for new customers and tracking usage through Ceph. And, you know, all of the reporting that we need to that end, a lot of the the monitoring and making sure that service was durable. We could reuse existing systems for because we had already been maintaining set for a long time, but the administration of specifically the s3 component was entirely new. And our design goal of that was to have a centralized system that could handle keeping all of the clusters in sync with what we think things should be while allowing the clusters themselves to be the authority on what data is stored within them so that we don't have to keep an authoritative real time record of who owns what, where, because that would be just a huge, enormous and maybe impossible to solve problems. Sef itself is eventually consistent. And so our tracking of who has what data in what cluster will eventually be right? But if we need to know right now what's there, we want stuff to be the one to tell you because it's really the only piece of the puzzle that can know it.
Tobias Macey
0:14:11
And as you were determining the deployment of the object storage, given that you already had the soft deployments for block storage, were there any different considerations that needed to be made for the hardware that it was getting deployed to? Or is everything just homogeneous across the different block and object storage supporting infrastructure?
William Smith
0:14:32
Well, I didn't have a very big role in designing the hardware, but I do believe it is a different build for object storage. From what I understand object storage, while it also wants very high storage density is more sensitive to running out of memory than block storage is. So we needed to make sure that these servers had enough memory to do what they needed to do to handle the load of storing these arbitrary size often smaller chunks of data. Compared to the large volumes allocated in the block storage system,
Tobias Macey
0:15:03
and I'm sure that the request workloads are also a lot more bursty than they would be for block storage, where as you said, it'll be a lot of small requests for potentially small files or fragments of files versus most likely more long running jobs on block storage that are going to be piping larger volumes of data through it.
William Smith
0:15:22
Well, really, the biggest difference there is that the object storage system talks to the public Internet just through our proxy, whereas the block storage system gives you volume that you can mount to a single server. So when you're using block storage, you've already got a server running. That is the only thing talking to this volume. And whether or not you're doing heavy IO, we're not getting requests for that same piece of data from like 1000 different places at once. And object storage since it's on the internet, and you could be updating data while someone else is retrieving it or you could be hosting images that are on a big popular website and getting requests from thousands of IPS at once, you do have a substantial difference in traffic. That was one of the reasons that we were looking so hard into rate limiting to make sure that no individual thing gave us so many resources that overwhelm the server. While rate limiting per client IP is important, by also rate limiting by the bucket that you were making requests to, we could make sure that if one piece of media somewhere exploded in popularity, those requests would still be seen and throttled or blocked, to keep the cluster online. The goal with rate limiting, of course is just to protect the overall system and not to limit the usefulness of it. So the limits are quite high. And I don't think that they've ever really been hit in practice.
Tobias Macey
0:16:49
And so as far as the actual rollout of the product to general availability, what was the process that you went through to ensure that you could have the necessary confidence to know that it would be reliable enough for people to be able to use it at scale and being able to feel comfortable with sleeping at night, knowing that it's out and being used by the general public? Oh, yes,
William Smith
0:17:13
I, I wanted to make sure I could sleep at night after we turned this thing on, we use the very agile process for the development of this product. So when I got brought onto the team, and we earnestly started turning this from a proof of concept cluster into an actual product available for people. It took, I think, about a month before we had an internal alpha available within the company. And that was available to all of the employees. We told them, do whatever you want go nuts, we might break it because we're still learning things about how this works as a product, but it was an incredibly powerful tool for giving us confidence and revealing the problems that customers would end up facing. We had one person in the company stand up a Prometheus cluster, which used the alpha as a back end for its data storage, which saw our IO go through the roof, like pretty much instantly because for me theists is constantly pulling data in, and then it's taking it back out and mutating it and putting it back in. And it was a kind of workload that we just couldn't have produced on our own without like a concentrated effort. But this employee already had prometheans stood up for something and he wanted to see if object storage was a good back end for it. And also give us a little bit of traffic, which was super eye opening, it helped us to tune Sef for the workloads that we would actually expected to see. And it helped us to decide how to write the administration part of the system that would need to be sensitive to these kinds of workloads and permit them because it was a real legitimate valid use, but it was just a lot of it. After the alpha lasted for a while, and we made some big changes to the system during that time from the things that we learned We went ahead and did a closed beta. And that was active for maybe three or four months, during which time a select subset of customers that had expressed interest in the product were given access to it and asked to put their real workloads on to it as much as they wanted to put them on a beta. And that too, was very eye opening less in tuning the system, because at that point, we had it pretty much where we wanted it to be, but more in terms of customer expectations and what features they would want that we might not have foreseen. And again, a lot of changes went through the system during the beta, as we learned the way that customers were going to use it, what kind of things they expected to see, and what deficiencies we had in seeing what people were doing in the system. Because it's a lot different when you've just got a bunch of customers that are doing whatever they want, as opposed to a bunch of people who you work with that you can send a message to and say hey, what what are you doing,
Tobias Macey
0:19:58
you know, and For workloads that are more data intensive, where you want to ensure that there's a high degree of durability of the underlying information, I know that you have support for versioning of the objects within the buckets. But I'm curious if you have any support for being able to do replication of data across different regions or different zones for being able to ensure that you have multiple copies, or is that something that would be pushed to the client application to ensure that replication of information?
William Smith
0:20:30
Well, presently the data is replicated within the cluster several times, there would have to be a pretty catastrophic failure for us to lose data that's stored in Sef. But we do not yet have support for replicating that data to other clusters. That's something that we spent a decent chunk of time looking into. But we decided to put it off until we knew just how much would be consumed by just regular usage in one cluster, before we committed extra space to replicating that data around the world and not knowing just how much space we were going to need.
Tobias Macey
0:21:04
And also, it depends largely on the use case, whether having it replicated globally would be useful at the object storage level versus just making it available through a CDN. Because if it's somebody who's running a website, they could use something such as fastly, or CloudFront, or anything like that, or CloudFlare, for being able to replicate that information to their users. Whereas if it's somebody who's doing a lot of data processing, where they might have disparate clusters and different offices across the world, and they want to be able to get access to that underlying data for analytics, that's where it's more useful for them to be able to actually have that data replicated to the different regions where they might want to access it,
William Smith
0:21:43
right. I mean, in some ways, if you replicate the data across the world, you are creating a CDN, or at least most of it
Tobias Macey
0:21:53
fairpoint. And so in terms of the benefits and new use cases that you've been Able to enable internally at linode? What are some of the more interesting or exciting capabilities that have come about from that?
William Smith
0:22:08
Oh, absolutely. Having object storage available has been super exciting within the company. And as soon as we opened up that internal alpha, people just started using it. And it was amazing to see the things that they came up with, because personally, I had things that I wanted it for, but I wouldn't have thought of the things that everyone else used it for. We've seen applications be made stateless in relatively simple ways, by just using object storage to store information that would have otherwise had to reside on a server, which has massively improved the rollout process for those applications because the infrastructure themselves are no longer important. They're no longer special. You can just fill them and make a new one and it's fine. We've seen it used as a storage back end for systems that just plugged into object storage which again, made them much easier. To deploy and maintain because they don't need a special disk or some other system to store the data, they've got this object storage that we already had and was already going to be maintained. We've seen it used as intermediary steps and pipelines. So as a convenient place that one pipeline could deposit data that something downstream would need. It's been a real game changer to have around. And I'm very glad that we did it for our own internal purposes.
Tobias Macey
0:23:27
And I'm sure that having it available as well as you're deploying the managed Kubernetes service has been beneficial as well, because of the fact that a lot of cloud native workloads leverage object storage as a backing store, rather than relying on persistent disk on the underlying infrastructure that they're running on.
William Smith
0:23:45
Absolutely. The lk service has built in support for our block storage product. So if you need persistent disk storage, it's available to your Kubernetes clusters on our platform, but object storage can be much more useful. When you don't want to have to mount a disk, or worry about any of those details, you just have data that you need to put somewhere and know that it's going to be there when you want to retrieve it later.
Tobias Macey
0:24:10
And as far as the s3 API itself, I'm curious what your thoughts are on its state as the de facto standard for object storage. Have you found it to be at all limiting? Or do you think that it's beneficial in general, that there is this one standard that everyone has coalesced around?
William Smith
0:24:29
Well, I'll start off by saying that the s3 API is obviously very good. It is ubiquitous. It has excellent tooling, it plugs right into a movie an amazing number of off the shelf things and it makes it very easy to work with. But when developing this product, specifically, I had to work with it at a much lower level. And when you actually start making calls to s3 directly, instead of using some tool or library to do it, it becomes very apparent the design limitations The API, for instance, it's often not easy to compile all the information you want about an object or a bucket. from just one API call, you have to as an example, to fetch the ACLs of a bucket, you need to make a separate call. And to fetch the permissions for other users, you have to make a separate call. And so it can be easy to build up a whole bunch of calls to s3 that you want just to compile one piece of useful information. Additionally, it's got some features that while very powerful, can also be very hidden, like the bucket versioning that you mentioned earlier, which has some very strange bits of behavior in that the prior versions won't be obvious when you are listing objects in most ways because they're not returned from the regular s3 endpoint. And you have to make special calls to find them. But you can't delete a bucket that is an empty so if you want to remove a bucket that appears to be empty, you might need won't be able to because it's got prior versions, and disabling versioning isn't actually enough to get rid of them. In all cases, you often need to configure a lifecycle policy, and then wait until they clear themselves out. We also had one customer during the beta period, who was very confused about the usage in one of their buckets, because we were reporting that they were using a huge amount of data. And from their perspective, it looked like they were using very little. And after working with the customer and looking at Sef and trying to debug it, we found that it was actually all hidden in multi part upload metadata. So s3 gives you I think, five megabytes per file per upload. And if the file you need to upload is bigger than that, you have to upload it in multiple chunks. Most clients handle this for you seamlessly and it works great, but the client library they were using was actually a Mineo client and Java was not a boarding multi part uploads that failed in the correct way. So the metadata was staying in there. But because you have to make several very strange API calls to see what multi part upload metadata is in a bucket. It wasn't obvious to them where this extra usage was coming from, which I think speaks to that. Well, the s3 API is fantastic. And it's very widely supported. It seems to have grown organically as new features were added to s3. And that's good. In some ways, you have a very well supported and maintained thing that a lot of people know how to use, but at the same time, it makes it so that if you're just coming into the game and looking at it, you might look at it and say, I have no idea what's going on here. Do I think that it should be standardized and redone? Probably not. I mean, at this point, it's so widely supported that it would be terribly disruptive to the entire object storage ecosystem, if there was just a new standard way to access it, that everything had to support and it probably wouldn't work but it is important, especially if you're working directly with s3 to really read the fine print. There are some gotchas in there.
Tobias Macey
0:28:01
Yeah, that that's definitely true. And anybody who's used it for long enough, I'm sure will have come across at least one or two of them. I don't doubt it. And I also believe that Sef itself, well, it does have s3 compatibility for its object storage. It also has a different protocol. But it implemented, at least initially, I think, called Swift. And I'm wondering if you have any plans to expose that alternative interface for people who want to be able to leverage object storage using some of the clients built for that
William Smith
0:28:30
they do have built in Swift support, we played with it for a little bit, it was not as widely supported in tools. And for that reason, mostly, we didn't consider it as important to implement. And so far, I don't think we've gotten any feedback from any customer asking for us to turn it on. So I don't think we plan to at this time,
Tobias Macey
0:28:49
and now that the object storage product has hit general availability and you're deploying it to more of your data centers. What are some of the future plans that you have? It either in terms of new capabilities or new internal tooling or support? Absolutely.
William Smith
0:29:05
For starters, we want to roll it out to more data centers and make it more widely available. I can't speak to where we're going next. But we're definitely going more places. We also want to add some exciting new features like letting you configure SSL per bucket. This is something that that open resti proxy is going to be great for, because it's in the perfect position to terminate SSL. And it is plenty dynamic to figure out what certificate you'd uploaded for a bucket. And, of course, we always have work to do on the back end to make it more robust and work more with the things that we find emerging as we deploy the product in more places.
Tobias Macey
0:29:43
And you mentioned that you've been involved in a number of different projects since you started at linode. And this is just the latest in a string of them. So I'm curious what your focus is now that the object storage has had general availability, is this something that you're going to stick with for a while or do you have something new on the horizon And you're planning to get engaged with?
William Smith
0:30:02
Well, I'm certainly still keeping an eye on it and supporting the team as best I can. But I had been moved off to another project. And I don't think that I could say what it is yet.
Tobias Macey
0:30:12
Fair enough.
William Smith
0:30:13
I don't think we've announced it anywhere. So
Tobias Macey
0:30:15
all right. And so as far as your experience of building out this object storage platform and releasing it publicly, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
William Smith
0:30:29
Well, definitely the most unexpected lesson was to keep an eye on sefs bug tracker, they're very good at reporting and fixing issues, but they don't always backport them to releases. And their release cycle is often slower than how fast we want bugs to be fixed in our clusters. So their bug tracker is massive. If you ever looked at it, it's a very big and very long running project. But since the project is open source, and they track everything so well, it's pretty easy to once you've got that handle on what's going on with it, take the patches they put up and compile them into our versions of stuff and make sure that our customers aren't affected by the bugs that are found upstream. I certainly wasn't expecting that when we went to shipping this because largely, that wasn't our experience with the block storage product
Tobias Macey
0:31:17
and other any other aspects of object storage or your work on the linode product or anything about your experiences of getting it deployed that we didn't discuss the you'd like to cover before we close out the show. Um,
William Smith
0:31:29
yeah, I would just like to put a point on how useful of a product it is and how many possible applications that we've found just internally, as another example of a cool thing we've done with this, the our front end application is a single page react app. And every time they put up a PR to it, which is it's open source, it's on GitHub. So every time they put up a PR our process, our pipeline builds that code and uploads it to object storage so that it's accessible immediately. And not only can the team that's working on it, see it, but the other teams that are related to that project that they're making front end changes for, can immediately see and use that code. And it's, it's a very powerful thing to improve how fast we're able to ship things and how much confidence we can have in front end changes and how much testing we can do with them. So that's just one more example of something that we found to do with it. That is very exciting, and that I hadn't even considered when we started the project.
Tobias Macey
0:32:31
All right, well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
William Smith
0:32:47
The most frequent question I've heard from people both internal and external, is how do I tell what's really going on in here, right? The nature of object storage is You want to upload data to it, and a lot of it and constantly. But unless you know upfront how long you're going to need to retain that data for. And if all of your data can be classified as this is how long I need it, the lifecycle policies that s3 makes available aren't really useful to you. Because they, they'll say, Okay, well, this is going to expire after a week. But if you have data that is going to need to be around for a couple months, and some other data that's only been around for a couple days, it can be very hard to manage. And what's worse, it can be very easy to not see that any of that data is sitting there until you end up looking at the entire bucket and say, Well, this is huge. And then you're left with the unenviable task of having a huge pile of data that can be very hard to figure out what of it is important and necessary. And what of it is just an artifact of something that should be getting gotten rid of. I think the biggest gap in tooling is something that would solve that problem that would help look at the data that you have in an object storage bucket and say, Oh, well, you've actually accessed this, this recently. And this is accessed all the time. And this has been sitting here for a year and a half. And you don't need it to help people actually manage their data without having to just have it accumulate forever, and build technical debt.
Tobias Macey
0:34:21
Yeah, it's definitely a big problem. And one that people are trying to attack with things like data catalogs, and data is just data discovery services, but it's still an imperfect art. And there's certainly room for improvement, and probably ones that can be more specifically targeted to object storage, as you mentioned, as well. So definitely something worth exploring further.
William Smith
0:34:41
So there's certainly not a silver bullet for it yet.
Tobias Macey
0:34:43
Absolutely. Well, thank you very much for taking the time today to join me and share your experience of building out the object storage platform at linode. And it's definitely useful service and one that I'm using myself. So thank you for all of your time and effort on that front and I hope you enjoy the rest of your day.
William Smith
0:35:00
Thanks for having me.
Tobias Macey
0:35:06
Listening Don't forget to check out our other show podcast dotnet at Python podcast.com to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcast comm subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!