Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Summary

With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Interview

  • Introduction
  • How did you first get involved in the area of data management?
  • What are the main serialization formats used for data storage and analysis?
  • What are the tradeoffs that are offered by the different formats?
  • How have the different storage and analysis tools influenced the types of storage formats that are available?
  • You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort?
  • Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?
    • What are the switching costs involved in moving from one format to another after you have started using it in a production system?
  • What are some of the new or upcoming formats that you are each excited about?
  • How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?

Contact Information

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello and welcome to the data engineering podcast the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it. So you should check out Linotype data engineering podcast.com slash Linotype and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipeline. So trying out the tools you hear about on the show. Continuous Delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production Indigo CDs the open source platform made by the people that thought works, who wrote the book about it. Go to data engineering podcast.com slash go CD to download and launch it today. Enterprise add ons and professional support are available for added peace of mind. And go to date engineering podcast com to subscribe to the show. Sign up for the newsletter, read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and coworkers and share it on social media. This is your host Tobias Macey. And today I'm interviewing Julian Adem and Doug cutting about data serialization formats and how to pick the right one for your systems. So Doug, could you start by introducing yourself?
Doug Cutting
0:01:24
Yeah, I'm Doug cutting. I've been building software for 30 or more years now, often with a serialized component. The last 15 or so years have been dominated by work on open source, most notably Hadoop. But for the purposes of this podcast, we're think we're going to talk about a project I started called Apache arrow.
Tobias Macey
0:01:51
How did you first get involved in the area of data management,
Doug Cutting
0:01:54
I worked on search engines for a long time, back at Xerox, early in my career at Apple in the early 90s, on web search at excite in the late 90s. And so we were always building systems that were analyzing large datasets, big, big collections of text, and through that ended up working on a project called nuts, which led to how to, and then it turned out, even though we were we built how to really to support the building of search engines, people ended up using it a lot to manage all kinds of other data. So I start, I sort of fell into the data management business through search engines.
Tobias Macey
0:02:38
And Julian, How about yourself?
Julian Le Dem
0:02:40
Yeah, so I'm honored to be interviewed along with cutting. So Baker, back when I was at Yahoo, I was working on content platform and got to use Hadoop early on, I was there when I first started and remember seeing Doug, presenting these new format at the time. So after that, that worked at Twitter, where I started the pocket project, in collaboration with the Impala team. And so working on a column now format, to improve our storage needs. And from there, I got involved in a bunch of Apogee project, like a big, big and Apache arrow, which is our presentation for in memory data. And from then on, I started being involved with more Apache projects. And so about how I got to the, in the data storage space. When I was at Twitter, we had Hadoop, on one hand that was very flexible and very scalable. So we could have a lot of machines and doing a lot of things like machine learning or analytics, or any kind of code you want. So it's very flexible, and you can do a lot of things. But it was still very much file system oriented. So a lot of the data was invalid fi's, and not that efficient. And next to the Hadoop cluster, we add vertical, which is a columnar database, which lays out the data in a column now or presentation to be much more efficient than retrieving the data from disk and doing analytics. And so Veronica was much lower latency to answer queries, but at the same and he was not as flexible, and he was limited to sequel, right, he couldn't run anything else, he was kind of a black box. And so what we try to do is to make Hadoop more of a database is less of a file system. So starting from the ground up, having a column narrow presentation, so kind of state of the art, things from the store paper, which is the academia, work that started the vertical company. And using the Dremel paper that describes this way of string, and misty did a structure in the column narrow presentation and getting into how do we make Hadoop more for database and be more efficient than retrieving data in processing data? At that time, so I did a lot of that time, I did a lot of reading in between the lines on the Dremel paper to kind of understand, you know, the missing parts that were not described there on how to use these representation in a more generic way.
Tobias Macey
0:05:33
The reason that I invited the both of you on to this episode is that in a lot of the conversations I've had with people in a context of data engineering and data management is which serialization format should they use for storing and processing their data, because as the sort of big data and data analytics spaces have continued to grow, and expand and importance and capabilities, there are so many different ways to store your data. And so that introduces a lot of confusion. And I'm wondering if we can just start off by briefly summarizing some of the main serialization formats that are available and inactive use for data storage and analysis. And some of the trade offs that they provide.
Doug Cutting
0:06:18
I can dive in if you like.
0:06:22
I think classically, in database systems, the data was captive, it was it was the format that it was stored in was controlled by the people who created the database. And wasn't wasn't a published standard format. And I think, these days, we've got this open source ecosystem of data processing projects, data storage projects, where interchange between systems is, is common and is useful. And so it disseminates a new problem. This, this having serialization format. See, the interchange was done before was relatively uncommon, it wasn't a primary source format, as it wasn't very optimized. So we had, we had things like CSV and XML. And we still see that a lot, because that is what a lot of applications can easily generate. They are well known standard formats. The end, you know, the problems with XML is that it's verbose, slow to process. And with CSV, and XML, to some degree is they don't do a very good job of really letting you store data structures with named fields that you can, you can process quickly. There's other things, you know, if you're trying to some sort of technical details, it's useful to be able to chop files into chunks that you can process in parallel, called splitting files, we tend to call it and you'd like a format that was split double, you'd also like a format that's compressible, and those who are can be at odds, coming up with a format that is both political and compressed is hard, you can't just take a CSV file and compress it, and then chop it up that that doesn't work, you've got it, you've got to have a series of compressed chunks inside the file that you can you can find. So the formats have developed over time, as we've learned, this is what we need to be able to interchange data between these, these different components between systems like Hadoop and Spark and Impala and hive and all these different things, it's really handy to be able to try different tools on on a single data set and be able to generate data from one tool and ingested into another and do so efficiently. So there's been a real real demand for that, you know, and then how you started with some formats, which weren't very good for interchange, and so did sorted hive. And so that the formats we're talking about a really second generation designed to, to to address this. And Afro is is a format that was designed to it, again, to address all these challenges, to be political, to be compressible, to have some metadata to give you standalone data sets that you can, you can pick up and see what are the fields in here, what is the data structure, but still efficiently process it, and to work across components in Britain in different programming languages. And there weren't a lot of things out there like that there wasn't really anything that I can find that met all those requirements, which was what what led to arrow, our other stores, things that record at a time. And, you know, in in order to have complete record and then another complete record and then know the complete record. And that's not the most efficient way to process things always. In a lot of cases, what you'd like to do is see all the values of a particular field in a record at once. So you got it, you got a million records in a file, and they all have a date. And then you'd like to just process all the dates, for example, and not see all the other fields in all those records. And so then you want to Columbia format, and that's really what what parking is about is is responding to that needs. So it's yet a generation beyond Avro, optimizing a really common access pattern, but also sharing all the other elements of being efficient, supporting compression, being language independent system independent, so that it can work as an Interchange Format, but optimized for particular kinds of analysis and access patterns that you see in data systems that ever is not optimized for that. Is that fair? Julian?
Julian Le Dem
0:10:32
Yeah, I think I think that's fair. Yeah. So and you know, and likewise, Barclays more efficient in certain access pattern that are very common when you do a sequel query. But likewise, our bro is going to be more efficient in a lot of other access patterns, when you're doing a lot of pipelines that transform data from that read older data and rights, older data. And everybody's going to be more efficient in those cases, or in streaming use case, where you want to reduce the latency when you read a single record at a time, and you want them as soon as possible when you processing streaming data. And Barclays much more efficient when you're doing a sequel query, for example, because you write the data at once. And so you can spend more time compressing better or doing it differently out, then they're oriented. But on the other end, you're going to access the data from very different point of views, like you're selecting only a subset of the columns and filtering and the subset of the column. So you may have hundreds of columns, but you're really querying on five or six of them. And so it's very beneficial to have this column now layout to access that data very quickly. And it compresses a lot better. And there are a lot of things you can do to speed up these analytics side of things of data processing. But you mentioned to be as that you know, people have to choose. But I think what we may hope over time, is to have better abstractions. And again, it's a little bit still remnants of the starting point of Hadoop as these distributed file system. And the ecosystem has slowly evolved adding layers on top of that. And it's becoming more and more of a database and having those abstraction layers on top, that kind of make it more seamless, whether the data is oriented or columnar and oriented, and what format it's in. Right, because depending on the use case, you may want different layouts. And it's kind of make it difficult for people to use to take advantage of this, if everything is art coded against a file formats. So we're evolving slowly to better abstractions. And again, it becomes more of a database, but more deconstructed because, like who something that doc said, related to what I was talking about with vertical vertical was this black box that you had to import your data into it to do queries, it was much faster to do analysis that Hadoop. But once the data was inside of it, then there was nothing else you could do with the data, then querying it through the sequel, query engine. So with parquet and Avro, you keep all the flexibility of the Hadoop ecosystem, right, you can use many different query engines, you can use many different machine learning libraries, or a lot of different programming frameworks. And each works with all of those. So you keep your options open, there's no importing your data in your silo anymore. You have your data in one place. And there are a lot of different things you can do to make things work together with different
Unknown
0:13:54
file formats or storage from it.
Tobias Macey
0:13:56
Yeah, and it sounds like, particularly in some of the commercial stations that I've had, a lot of the confusion was it sort of bundled up into the idea that people need to pick one format, and then figure out a way to use it across every aspect of their system, where it seems like what would be more beneficial is, for instance, using row oriented format, such as Avro, in a streaming context, where you're going to be processing one record at a time or for you know, data archival, where you might need to just find all of the information about a particular record at some future date. But then if you're going to be doing live analytical queries, where all of the data is going to be housed in like something like Hadoop or hive, then you would in most cases be better served by having them in a columnar format, such as parquet or some of the other formats available for that. And then maybe just using the ATL pipelines as a means of transforming the rower into data into column oriented data so that you gaining the benefits of each format in the context in which is best suited.
Julian Le Dem
0:15:06
Yeah, so they're always going to be exceptions. And so you know, in many analytics use case a column, our presentation is better. But they always corner cases where it becomes more expensive, or, especially if you're going to access most of the columns every time. So it depends, and it helps a lot to have better mutated abstractions. So one of them is the hive meter store that can be used as an abstraction. But they're more and more showing up on different companies have built different abstraction that cannot let you abstract out from the users, what format it's actually in. And I think that's very important
0:15:49
to have these kind of capabilities.
Tobias Macey
0:15:52
Yeah, and a lot of times to people are judging some of the more involved in elaborate formats against run of the mill systems that are using things like JSON, or as you mentioned, CSV and XML. And so really, any format that's more suited to a analytics workload, in general is probably going to gain them a number of benefits versus what they had been using.
Doug Cutting
0:16:18
Yeah, for sure. You know, they sort of plain text formats like like CSV and JSON are going to are going to be a lot slower, you know, the, they're nice, you can, you can look at them a little more easily without a tool, but you're going to have some real performance impact and stored size impact. But it's also, there's a in selecting a format, you don't want to think about hyper optimizing for a particular application, I think people are, you know, realizing data is is an asset, you want to land data in a good strong format that will last a long time that you can use in as many different applications as you can, and not have to reproduce it in a lot of ways. Now, there, there are times when you might want to try swarming into a different format is an optimization have a data set, which is derived from another one, but then have a pipeline, so you can do that repeatedly. And think of one is being generated from the other. But having the original format, you know, one of the reasons that that I went down the route of of building arrow was I was worried about a proliferation of data formats. And if you've got a, if you want to have an ecosystem, and each component, it has its own data formats, then you've got this, this number of of translations it gets gets to be exponential between the different formats in all the systems. And and you you what you really wanted to have some some common formats that are useful by a lot of systems. So I think there's there might be an optimal format for a given application. But if you kept every data in that format, you you might end up fragmenting your data and making it getting less value from it in the end. So there's a trade off there between optimizing a single system versus having maximal reuse of your data, and enabling easy experimentation and longevity of your data. So you want to sort of curate a data collection, which is going to last a long time. So I think there's a little more to it than just just the performance.
Tobias Macey
0:18:28
even beyond things like arrow and parquet, there are a number of other formats that are available. And sometimes it can be difficult to determine whether some of them are superseded by newer formats, or if each of them is being particularly tuned for a given use case. So some of the ones that I'm thinking of Thrift and orc, and then there are newer formats, such as arrow that is being promoted as a way to provide easy interoperability between languages and systems when an in memory content taxed for being able to bridge those divides. So I'm just wondering if either of you have any particular insight into the broader landscape of how some of the formats I've evolved, if there are any, that if somebody is just starting now that they should necessarily avoid because it's been superseded, or if it's if each format is still relevant for their particular case that they were designed for. So
Doug Cutting
0:19:26
thrift, and protocol buffers are interesting. They're very good serialization systems, but they don't include a file format standard. So if you're talking about data that you're going to pass around as files, there isn't a standard one for protocol buffers, and through various people who have, you know, store data in them, it's a little a little more challenging to make a standalone file in this because they're the people they have a compiler that takes the the ideal and generates the readers and writers for various programming language is and you could you could embed the idea, you can bet a reference to it. But it's it's a little awkward for building standalone file, which you could pass between institutions say, so I don't, they're not, I wouldn't recommend looking to thrift or Protocol Buffers for a file format for an RPC system. That's really their sweet spot. That's where they've been used a lot, and tend to be used very successfully, though. So there's, that's for your data on the wire rather than data on on disk. In a file. orc, you also mentioned is a competitor to i think i think it's safe to say she used the word competitor to parquet started, shortly after parquet was started has minor pros and cons. I think it's unfortunate, we have another format that is so similar in its capabilities to parquet. And I can maybe Julian wants to speak more to the pros and cons of or Christmas party.
Julian Le Dem
0:20:58
So yeah, I can give a little bit of the history. I think strafed protocol buffer and Avro preceded parking and parking kind of build the tried to be complimentary to them, like one of the things they define is the ideal, and how you define your type system. And Avro is definitely better at all the parts about whole kind of pipelines type of codes when you need to understand the schema and do transformations, and makes this easier to deal with schema evolution and understanding the schema and be more self describing and passing the schema schema along with the data. And so parquet is trying to not redefine the ideal, but just define a columnar format that you can can become complimentary to those things, right. So you can reuse your same you have the seamless replacement when you can use your same ADL that you're using with arrow, for example, that describe your type system. And use these columnar representation on disk when it's convenient, right when it's the right use case. So maybe you are using ever before. And you can still use Avro as your model, but you can swap with the arrow fight format, which is reoriented when it's useful. And you can swap to park a column now representation when it's better for a sequel analysis. So that's one in and so in the history of market versus RC, I think back in the day there was these need for a column narrow presentation on on these for Hadoop, right? So I this use case when I was at Twitter was trying to make Hadoop more like vertical. And there was this need, and you know, there was a little bit of overlap on people working on those columnar format. And then you start talking about it when it's ready, right. So you kind of publicize it and you say, hey, look, it's open source, we're trying to build that we think there's a need for it. So it's a little unfortunate that you know, bad today, I connected with an Impala team that was trying to do something as well. And later on, we connected with other teams and kind of grow the park a community that there was these parallel efforts. So you know, their representation of nested data structures is different. So Parker uses a Dremel model. And, or sees using a different model, where they're going to have very similar characteristics, because they're trying to solve the same problem. I think parquet has been better at integrating in the ecosystem. Like from the beginning, I was really aware that I didn't want to build another proprietary file format, you know, same problem that if you're importing a database, your data, then you can use it only in your database, I really wanted it to become like a standard for the ecosystem. So from the beginning, from the community building point, have you I spent a lot of work kind of making sure people's opinion were integrated into design, like the drill Apogee drill team had some needs for new types. And we integrated their needs, the entire team was coming with a c++ native code execution engines. So the market format is very language agnostic, and we merged our designs early on to create parking. And so it's been very open and making sure like people would come and get what they need. So a team at Netflix did the work of integrating with crystal, and they had some special needs, because they were using Amazon and s3 at the time. So we made we did the work to make sure it would work well for their use case, as well. And just being often and at some point, you reach a critical mass, and I more people start using it, because that's what you know, they see it starting as their email teams and projects using it, that didn't make sense for people to reuse the same format instead of inventing their own. So I think that was part of their success of 4k was to be very open and very inclusive in the community early on. And you know, sparks equal started using parky. And we didn't even have to help them, right, they just decided to do it. And indeed it and once you were done, they talked about it. So you know, the effort you put early on to be inclusive, it paid off pretty well. And now Park is pretty much supported everywhere. And but i don't think i think you know, technically, the characteristic of Paki are going to be very similar to RC. But what makes it more valuable, I think and again, you know, being the party guy, I'm biased. But I think that's something that was important to me early on, to make sure that we were making something standard that we, you know, we keep the flexibility of Hadoop, which is the beauty of the ecosystem is, there are all those tools you can use. And you're not like siloed in one tool because of the strategy or you pick. And so the last part is talking about arrow. So it's kind of the next step. So we talked about serialization format. And so our role and parky as a storage layer on top of Hadoop in HDFS and arrow it thinking about, you know, the same problematic but in main memory, because the access patterns and the characteristics, you know, the latency of accessing main memory computing to accessing this different. So when you are storing data in memory, you similarly there are benefits to using columnar or presentation in memory that is zero. But the trade offs are different, right, the latency of accessing memory versus disk is different. You want to optimize more for the throughput of the CPU than in arcade, you want to optimize more for the speed of getting it off of disk. So there are different trade offs that weren't a different format. And so that's where arrow is more from in memory processing. And you know, as technology has evolved, we used to have, you know, late domain memory, and more disks. And now there's more and more main memory, and there are more tears showing up, because now we used to have spinning disks, now you FSS DS, with flash memory, and you also of envy me, which is non volatile memory, which is flash, but in the dim slots. And so you have different characteristics of the latency of accessing the data, the throughput of reading the data are different, right. So you have different trade offs, and also the cost of storage. So the how much main memory versus how much envy me versus how much is the versus how much is being destroyed you have. And so those different trade offs will apply, right, you have more range of where you store the data, and how fast you can access it and process it.
0:28:15
And so all those things are very interesting. So that's where you can have things now I'm on the spectrum. So arrow is more on the memory end, and Barclays Monday on this end of optimizing the layout for query processing. And there's going to be in the future, there's going to be interesting evolution on where which one is more efficient. And that's where kind of abstracting more where the data is stored. And making this more manage like in the database is going to be interesting in the future in simplifying that problem for end users. Ideally, arrow is something that end users don't need to see or be aware of, I mean, they can be aware of it, but they don't need to writing their code, then they reading arrow writing arrow is kind of more from a database manage.
Doug Cutting
0:29:10
But that's a good distinction, that arrow will tend to be used within tools. And that, you know, maybe people would will will indicate they want to use arrow to as the format to pass things between two systems. But it's not a it's not a persistent format, in the way that everyone part kr is there. They're all three very complimentary use cases of real parka and arrow.
Julian Le Dem
0:29:35
Yeah, so those are those three categories, right? So when you were listing all those serialization formats, you have their oriented column oriented and on this for persistence, and columnar for in memory purchasing as a streaming category and for arrow, you know, we started the community from you, we already build a community on top, black, white building Parkway, right, all these getting people together. So hopefully for we can manage to have a single one representation, that becomes that much more valuable, because it's interoperable, right, if we can agree on having that same representation for a memory, then things are more efficient, because you don't need to convert from one format to the other. And also simple because you don't need to write all those conversion from one format to the other. So there's a lot of benefits of agreeing early on and what the format is going to be. And be on top of that. So which is which we're doing with arrow.
Tobias Macey
0:30:35
One of the questions that I had in here as well is the subject of how important it is for a data engineer to determine which format they're going to use for storing their data. And what the switching costs are that are involved if they come to the realization that the format that they chose at the outset doesn't match their access patterns. But from our conversation, or particularly, about Avro being useful as a format for being able to use it in multiple different contexts, that seems like what's more important is just making sure that all of the data that you store is in the same format so that your tooling can be unified no matter what you're trying to do with it. And that if you do need to have different access patterns, then at that point, you would do the transformation for that particular use case, just wondering if I'm sort of representing that accurately.
Doug Cutting
0:31:34
That sounds right to me. And if you know that your access patterns tend to be sequel, then you might use parquet and and you tend to get batches of data at a time than those are, that means that that parquet could be your primary format. And if you've got more streaming cases, and then and you're not doing sequel, as much, then than ever, it might be the primary format, converting between those two do can be done pretty much lossless Lee, there's probably a few edge cases, lots of land automatically. So you're not you're not stuck forever. So knowing knowing a bit about your applications, and then picking for given data set what what the best of those two, I think is probably a good good path for most folks.
Julian Le Dem
0:32:16
Yeah, so the Java libraries of parking, I've been designed with having these, you know, drag and drop capabilities. And you can use, let's say you use an avenue for designing your model of all your data. And then you can, let's say you use just MapReduce jobs for doing ATL, you can just replace the output format to be parquet versus arrow, and it's very flexible. And so from a programming standpoint, you still read and write every object. But under the hood, you can swap to the ever oriented format, or the park, a columnar format. And it's a pretty much seamless.
Tobias Macey
0:32:59
One of the things that I'm curious about is particularly given the level of maturity for both of these formats, and some of the others that are available, what the current evolutionary aspects of the formats are what's involved in continuing to maintain them, and if there any features that you are adding or considering adding, and then also the challenges that are that have been associated with building and maintaining those formats.
Doug Cutting
0:33:29
My experience with file formats is that they're things you don't want to change very quickly. And because compatibility is so important, people don't want to have to rewrite their data sets, they want to be able to take days that that they they created five years ago and process it today, using the latest versions of software. If you version the format a lot, then it can be really tricky to make to guarantee that you can read it, you also want to, in many cases guarantee that things generated by a new application can be read by an old application. So you need both forward and backward compatibility. And in most in most of the way that most organizations work, they don't update all their systems in parallel. So you really have very few opportunities to change the format itself. What we tend to focus on is improving the tools, the usage, the API's, the integration with programming languages, higher level ways of defining types, things like that, rather than at least as the case forever, rather than extending the the basic format. Because we can't do that without breaking people. And people want to be able to need to be able to rely on the format having both that forward and backward compatibility.
Julian Le Dem
0:34:43
Yeah, so like Doug said, bar Kays evolving slowly, for those same reason, right. So first, we need to maintain backwards compatibility forever, right, some when something has been written, we need to make sure it's always you always going to be able to read it. And also, the forward compatibility means also, when you add features to a file, you want to the old, the old readers to be able to read the data, and they're not going to take advantage of the new features, where they're still going to be able to read data that has been redone with a new the new library in a way that still works, right. So for example, some of the new features that are being added to park and there are some discussions about it. One of them, some of them have been like very simple things like there has been better compression algorithms that came in the past few years, whether it's broadly from Google or the standard from Facebook, and then could provide better compression, rash ratio and speed of the compression. So those are relatively simple. And you can have, you know, you need to make sure that it's clear for people that is starts using the new compression, then only the new version of the libraries will be able to read it. And then there are other things that are more advanced, like blue features, for example. And so there are different things that need to be taken into account. When we have some features. You know, first arcades have language and knock agnostic format. So you can't just like make a Java implementation, for example, and say it's done, we need to make sure that there's going to be a Java and a c++ implementation. And we need to make sure we have a spec that represents so that we document the binary format in the spec, right, it's not just look, there's a bloom feature feature. And here's the API to access it, we actually define every bit of the FIDE format in the spec as well. So that you know, it can be implemented in both languages in Java and native and going to be consistent. And so you know, I doing cross compatibility testing and things like that. Other things that are challenging, is more like semantic behavior. So for example, in parquet, we added the timestamps as a type, right, so you already from the beginning, you add ins and flows, and viola, bro lens, like strings. And so adding timestamps, actually, there are a lot of ways you can interpret the timestamp. And the sequel spec has different things like time, some time zone or without time zone. And it's a little bit challenging to make sure that the semantics are understood the same across the entire ecosystem. So that's where, you know, you need to make sure there's good communication between communities. And there's a lot of work, you know, it's not just code, it's also collaboration between communities, because you want to make sure that when you write a timestamp in sparks equal, then a there's no time zone problem when you read it to his hive. And so that you interpret the data the same way between sparks equal hive, Impala, drill, and all those Corey engines and system that use the park a format. So it's a little bit challenging sometimes. And sometimes it's slow moving. But you know, it's people's data, right. It's not transients. Once in stored, you want to make sure it starts correctly. And this is a persistent system, right? So you want to make sure you're going to be able to read it in several years. And it's not going to absolutely your data is not going to become obsolete, as a library.
Tobias Macey
0:38:57
How do you think that the event evolution of hardware, and the patterns and tools for processing data are going to influence the types of storage formats that either maintain or grow in popularity?
Julian Le Dem
0:39:12
So I think I touched a little bit to that earlier. You know, the evolving hardware, there are a lot of things that are evolving at the moment, whether it's SS IDs, that are very different characteristics to spinning disks, or, you know, envy me, which is basically something that's cheaper than memory with slightly more latency are faxes. But you know, the data is shifting, you have more and more tiers of storage for the data was different characteristic of how much it costs to store the data there, how fast is it to retrieve it, and process it. So this is going to influence how people started out. And it's kind of that explain why you have parky and arrow, and how you want to be able to convert from one to the other really fast and use of different trade offs on how much you compress your data. Because you know, comparing the speed of IO versus speed of CPU. And so it's going to be very interesting. The other technology aspect that's coming in is a GPU. And so people are using GPUs more for doing data processing. And actually, arrow has been used. There's a Google AI group that is defining columnar presentation in memory or presentation for GPU processing. And they're using arrow now, as a standard for interoperability and exchanging data diff between different GPU based processing systems. GPUs are also getting more and more memory, right, one of the problem of the GPU is the high bet the high cost of transferring data from memory to GPU memory, compared to the speed of the GPU itself, right. So the GPU can process data really quickly. But it's costly to move the data from the main memory to the GPU. But you know, as you can see a pattern where the GPUs are getting more and more memory, because they're using more and more for data analytics and machine learning. And as just for, you know, video games, and so it's going to be interesting to see how those evolve. And these these different trade off of memory storage versus spinning disk storage, they're going to shape a little bit how we do as a storage, and now we improve the layout and the compression, you know, having more or less compression, whether you want to do more speed or more compact storage is going to be very interesting.
Doug Cutting
0:42:08
Yeah, I'll just second what Julian has said, for the most part that, you know, that's it's there's all these time space trade offs that you're making, where you, you know, if you have something that's completely uncompressed, it can very, very fast to process in memory, but it might take up a lot of memory. And if you compress it a bit, you could store more of it in memory. And so you'd be able to get more work done before you had to hit some slower form of storage. And, and those sort of trade offs are, are very tricky, and they're very sensitive to the relative performance of these different tiers of storage. With you know, we're starting to see some very fast, persistent tears, which change things, you know, you can start to think of things that are that are, you know, access within a few cycles as storage systems, because because the memory persists. So we'll see what end up being the most effective formats, I, you know, arrows and interesting thing to, to track, you know, and sort of fighting, all of that is, is this need to have standard interchange formats, you don't want to adopt a format for a fringe architecture, you really want to not not really keep a lot of your data in a format unless you've got an ecosystem of applications, which can share it in that format. And take advantage of that, that format for which it's an efficient format. So you know, Africa narrow or are each designed for sweet spots of today's ecosystem. And I suspect will survive for quite some time for many years yet, but not unlikely that some other formats will will join them as we, this, this sort of storage hierarchy evolves.
Tobias Macey
0:44:03
And are there any other topics that you think we should cover? Before we start to close out the show?
Doug Cutting
0:44:09
I one amusing anecdote.
0:44:12
When, before or probably around the same time that Julian was starting Dremel, I created myself a columnar format, called travesty. And I tried to reproduce what was in the Dremel paper and and Julian mentioned the the missing bits in the paper, I could never recreate them. And and and so I came up with yet another way of representing hierarchical structure within within a columnar file system. And then she really came along and bested me with with it, because he was actually able to understand that Dremel paper and and implement it fully, and also really develop a strong community around that. And you know, and I traveled, he hadn't caught on in in any quarters yet. And so it was a, the wisest thing to do is to let people forget about it, because we, you know, we don't need multiple formats that that are, are very similar. They're filling the same niche. I'm pleased that that Dremel came along and, and replaced, traveling to the degree that Trevor never had a spot. Anyway, it was mostly I couldn't couldn't figure out this missing bits that Julian did figure out in that Dremel paper, they were they were pretty, pretty quick and breezy and parts of it.
Julian Le Dem
0:45:32
Yes, it's a little hand wavy indicator paper. And I had to hit my head several times to kind of figure that and kind of finding out what was going on. I feel really bad for a while about, you know, traveling, he has kind of parky replacing it. But you know, I'm glad we're in good terms.
Doug Cutting
0:45:55
You know, it was it was good that traveling hadn't caught on, and nobody, people had built systems around it and had large amounts of data in it, then that way, you know, in some to some degree, we would have had to commit to preserving compatibility with it, but it never really got got that critical mass before party showed up and started it started to become significantly more popular. I No, no, no, ill will. Te r e vi, N. I, it was invert spelled backwards, for no good reason.
Tobias Macey
0:46:28
Is there anything else that you think we should talk about before we close out the show?
Julian Le Dem
0:46:33
No, that's it, I think,
Tobias Macey
0:46:34
well, for anybody who wants to follow the work that both of you are up to and the state of the art with the your respective serialization format. So I'll have you add your preferred contact information to the show notes. And then just for one last question, to give people things to think about, if you can each just share the one thing that is at the top of your mind in the data industry that you're most interested excited for? That? How about you go first?
Doug Cutting
0:47:02
Sure. I mean, I'm just fundamentally excited by this notion of a ecosystem of open source based ecosystem of data software, I think we're really seeing an explosion of capabilities for people to get value from data in a way that we didn't in prior decades. And I think it's, we're going to continue to see this, the the power that people have at their at their fingertips explode. And the possibilities, you know, this year, we're, we're talking a lot about machine learning and deep learning. And I don't know what it'll be next year, but I there will be something and it'll be, it'll be able to really take off. And it'll be something that is useful. It's not just hype, because the way this ecosystem is, is driven by users. It's this this nature and nature of the loosely coupled set of open source projects. So I'm continually amazed by that and continue to be excited, I think that's going to deliver more good things to people.
Tobias Macey
0:48:11
And Julian.
Julian Le Dem
0:48:13
Yeah, I. So yeah, that I agree with this, you know, you can see this deconstructed data stack where like you used to have the database was very siloed. And you know, a fully integrated stack. But in this ecosystem is kind of each component is kind of becoming standard independently. And so you have 4k as a columnar file format. But you have also other components of these deconstructing database, like calcite is the optimizer database optimizer that have been used in many projects. And it's kind of the optimizer layer of a database. And its kind of reuse component Parky's storage. Got them now earlier, arrow is a in memory processing component, and it's kind of those things being reused, add a lot of flexibility to the system, right, because you store your data. And then you can have many different components that start interacting with each other. And you have the choice for different type of sequel analysis or different type of machine earnings of different type of juice, then ETFs, and more streaming. And all those things can interact together in an efficient way. And that's where like things like parky and arrow arrow are contributing is helping with interconnecting all those things in an efficient way. Because initially, you have things, you know, lowest common denominator, like CSV, or JSON or XML, where the starting points because it was easy to was supported everywhere. But it was not very efficient. And now we're getting to that second generation of we're looking at so what the common pattern that all those system needs, and what's the efficient way of handling them communicating. And that's where like the column narrow presentation, things like arrow and park a or for analysis, or for more streaming things, or ETFs, side arrow is the better presentation. And so you have those standards that evolve and that enable having these dis constructed database, right, all those elements that are very flexible, and it's kind of loosely coupled and and can interact with each other. So I think the next component that is starting to evolve is having a better metadata layer, like knowing what are all our schemas? How would you say evolve? What are the storage characteristics? How do we take advantage of our storage layer or interconnection between systems. And he's going to become more of more of that very powerful, very flexible, deconstructed database.
Tobias Macey
0:51:07
Well, I really appreciate the both of you taking time out of your day to join me and go deep on serialization formats. It's definitely been very educational and informative for me, and I'm sure for my listeners as well. So thank you again for your time, and I hope you each Enjoy the rest of your evening. Thank you.
Doug Cutting
0:51:26
Thanks, Tobias. It's fine to site fun to find somebody who actually cares about these things.
0:51:32
Kind of the boring backwater of big data
Liked it? Take a second to support the Data Engineering Podcast on Patreon!