In this episode PhD researcher Xinyu Zheng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file format
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the F3 project is and the story behind it?
- We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?
- Parquet is the format with perhaps the broadest adoption. What are the challenges that such wide use poses when trying to modify or extend the specification?
- The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer?
- Can you describe the key design principles of the F3 format?
- What are the engineering challenges that you faced while developing your implementation of the F3 proof-of-concept?
- The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks)
- What are some examples of features in data lake use cases that could be enabled by F3?
- What are some of the other ideas/hypotheses that you developed and discarded in the process of your reasearch?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on F3?
- What do you have planned for the future of F3?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- F3 Paper
- Formats Evaluation Paper
- F3 Github
- SAL Paper
- RisingWave
- Tencent Cloud
- Parquet
- Arrow
- Andy Pavlo
- Wes McKinney
- CMU Public Seminar
- VLDB
- ORC
- Protocol Buffers
- Lance
- PAX == Partition Attributes Across
- WASM == Web Assembly
- DataFusion
- DuckDB
- DuckLake
- Velox
- Vortex File Format
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:17] Tobias Macey:
Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin today to get started. And for dbt cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud. You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is ACID compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at mongodb.com/build today. Your host is Tobias Maci, and today I'm interviewing Xinyu Zheng about the future proof file format. So, Shen Yue, can you start by introducing yourself?
[00:01:41] Xinyu Zheng:
Hi. I'm Xinyu Zheng. I'm currently a fifth year PhD student at Tsinghua University, where my advisor is Huanqin Zhang. In a prior life, I got my bachelor's degree from University of Wisconsin Madison. I also have internship experiences at Rising Wave and Tencent Cloud. So currently, my research work mainly focuses on database systems with a particular interest in columnar formats,
[00:02:04] Tobias Macey:
multimodal data lake, and data infra for AI. And do you remember how you first got started working in this overall space of data and data management?
[00:02:12] Xinyu Zheng:
So my story with data management started from my junior year when I was an undergrad at UW Madison. So I first took the undergrad database course and the undergrad operating system course together. Then I feel like the building system has much more fun to me than doing pure theory or algorithm stuff because you can build the whole system from the ground up and really knowing all the details of building things, and then you can do performance improvement and immediately see if it can work in your whole system or not. So after finishing the course, I joined a research lab at UVA Medicine to start doing some database research.
I think I stayed at the lab for about a year, mainly working on distributed transaction processing stuff. After that, I naturally applied for PhD in databases. And then I came to Tsinghua to continue working on databases. That's now me today. Yeah.
[00:03:21] Tobias Macey:
And so one of the recent projects that you completed was this research paper that I came across called the future proof file format or f three. And I'm curious if you can just summarize some of what that is focused on and the story behind how you got involved in that research.
[00:03:40] Xinyu Zheng:
Okay. So the full name of f three, as you as you just mentioned, is future proof file format. And as the name indicates, we aim to build a next generation columnar file format that is future proof. In order to achieve this goal, when we designed the format at the beginning, we tried to make it efficient, interoperable, and extensible at the same time. That's the basic principle when we design the format or the high level goals. I will go through the details of those designs later in our conversation. And regarding the story behind the F3 format, I think it's a very long story because it basically covers my entire PhD life, but I will try to keep the story short.
So the entire idea of improving Parquet and then creating a new format actually came from a very simple observation that reading Parquet into Arrow consumes too much CPU cycles, And Parquet itself, as a relatively old project, has many history burdens to deal with when people are integrating Parquet with Arrow. There is actually a public conversation between my two collaborators or advisors, Andy Pablo and Weiss McKinney, to actually discuss this in public. I think you can still find the recording on YouTube. It's one of the CMU public seminars.
I think it's in 2021, think. So after we notice the problem of Parquet, we think, hey, we should first do an evaluation or benchmarking paper to quantify what's a real problem and to see where we can improve. So then this idea became our first VLDB paper, which is called An Empirical Evaluation of Commodore Format. And in that paper, we systematically studied the internals of Parquet and org, and we indeed quantified a lot of problems with those formats. So after that paper, because we have so many lessons learned from the evaluation from the deep dive study of the format internals, we thought it's the right time to build a new file format, given the old format has many problems.
At the beginning, we tried to make it some kind of collective effort to not make the community too divided, right? We want to really build a format that everyone agree on. So we actually gathered folks from a lot of different teams. By that time, we had a meeting, for example, from Meta's nimble team, from CWI, and there are also folks from NVIDIA and SpyroDB. So that time, we thought if we gather all the related people, all the expert in this area, and then we somehow reach a consensus, it will then be the next generation format, right, because we got all the brilliant minds here. But eventually, that doesn't work for a lot of reasons.
For example, one reason is that if there are too many people in the room, then there will be just too many different opinions, and the speed to push the progress towards a consensus will be slow. It's kind of hard to reach a true consensus between so many people. There are also some other legal issues regarding the copyright stuff, I think that's minor. So in the end, we ended up building our own format called f three. That's the whole story.
[00:07:31] Tobias Macey:
So you brought up some of the formats such as ORC and Parquet, which have been very widely adopted, but I'm wondering if you can talk to some of the limitations in their design and implementation that necessitate the continued creation of newer file formats and some of the problems that you're trying to address with them with the work that you're doing on f three?
[00:07:55] Xinyu Zheng:
Well, I think there are, in general, two driving forces behind this wave of new file formats. One is the hardware performance, and the other is the changing workload access parting. So let's talk about the hardware first. We can see that the performance of storage and network speed has improved by orders of magnitude over the last decade, right? But the computation performance is not. Like, the single CPU core performance has stagnated for a long time, right? And the storage just gets cheaper and cheaper and also faster. So what does that mean for file formats?
The first thing is that many previous assumptions simply do not hold anymore. For example, in the old Hadoop era, when all your storage is hard drives connected by slow network, when you design a file format, what you want is better compression ratio and smaller files, because that typically means better query performance. And the reason is that the bottleneck of the whole query stack is on IO, right? And smaller files just save IO bandwidth. And right now, things are different. You cannot only optimize for smaller files for better compression ratios. You have to also think about the overhead when you are doing the decompression, like how much computational cost you have to pay when you are reading the data back from the file, right? Because now the CPU is the bottleneck instead of the IO.
Therefore, the encodings and compression inside the file format has to be changed. So basically, which means it has to be lightweight. By lightweight, I mean the decoding and the compression cannot consume too much CPU cycles to make the process a bottleneck. And the second thing related to the hardware train is that because the single core performance is stagnating, we have to make the decoding process of the file format parallel. By parallel, I mean we have to make the good use of some parallel features of the hardware. For example, SIMD instructions on CPU and SIMT on GPU.
And there are actually a lot of recent research work in academia to optimize toward that goal. Those work are mostly from but, yeah, I mean, those are great work, and we are already seeing a lot of industry formats adopting those new research. Okay, so that's all about the hardware part. And as I mentioned, there is a second driving force for the new file format, which is the workload change. So let us first think about what kind of workload we expect when we are using Parquet, right? So data is mostly two dimensional table without too many columns when we are storing a table in Parquet. For example, imagine we are storing a line item table in TBCH.
That's a very typical OLAP workload. And so what we are going to do upon the line item table stored in Parquet file is that we want to do some batch oriented analytics on it. For example, we want to do a full scale of some columns. We may want to do some aggregation on this column to do a sum or average, right? And we may also want to do a filter on it to get rid of the unwanted rows. That's all. But right now, with the new machine learning workloads getting popular, we saw that when people are doing feature engineering, for example, there will be table with thousands of columns.
And each column is just a feature added by the feature engineers. And the query only wants to read very few columns. We call this part white table projection. And Parquet is not optimized for this part, right? So you will always have to parse the entire file footer, which contains the metadata of those thousands of columns. Despite that, you only want to read one column. So the underlying issue is that although Parquet is columnar format, its metadata is not. Its metadata is encoded using a protocol called Thrift. Thrift, you can imagine, is something similar to protocol buffers.
So that's a bottom issue. And this issue is actually causing a lot of problems for the real machine learning workload. That's what I mentioned called white table protection. And there is another workload access part related to machine learning that is random access. And in my opinion, the random access part comes from, I think, two sources in general. The first one is training, and the other one is serving. Let's talk about training first. For machine learning training, we ideally want the data to be fully random. We want the data feeding into the machine learning training pipeline is not clustered by any schematic so that the model training process can be smooth and can have a very good training outcome.
In that case, if we don't pre shuffle the data, which is a very costly operation, we'd prefer random access data. We want to, for example, pre generate permutation of row IDs of all the data. For example, if we have 10,000,000 rows of data, we just generate a permutation of 10,000,000 row IDs. And we use that row IDs to fetch the data randomly. And this is just a pure random access to the underlying data. That's for training. And the second source of random access pattern is from machine learning serving or model serving, where the serving engine previses the rack to do vector search and then gets a top query result.
Using the top query result, the engine will typically query the data using the top key row IDs to fetch the original data and then gather the data back to the user. And that top key access is also a random access. And as I just mentioned, formats like Parquet are now designed for such random access pattern. When you do random access on Parquet, what you will see is that the performance will suffer from read amplification on both IO and the computation. This is because parquet layout and encoding algorithm are simply not designed for optimizing random access.
I guess I have talked a lot, so just to summarize our wrap up, there are many changes brought by both hardware and workload, including faster storage, including the so called white table projection, and also including the random access parting. And a lot of things inside those file format need to be changed, including the metadata, including data layout, including encodings, including compression, etcetera. That's why I think we see more and more new formats in recent years, and that's why the
[00:15:49] Tobias Macey:
older format like Parquet needs to be changed. Another interesting aspect of the ecosystem of file formats is their integration with or support by some of the various table formats as well with Iceberg being one of the more notable ones, but even recent, work like DuckLake, supports Parquet and some of these other existing file formats. And then there's the Lance project that has both the file format and the table format. And I'm wondering if you can talk to some of the ways that you see the role of these file formats in conjunction with the table formats in this broader ecosystem of being able to evolve with the new requirements around data types and data applications.
[00:16:37] Xinyu Zheng:
So you mentioned two terminology here. One is table format and the other is file format. I think we should first go through the story of those two terminology. Right? I think the very beginning, there is only file format, and the data lake is just you putting a lot of Parquet files into the objects storage like S3, and that's people originally called a data lake. Then when people are doing this thing, they found it does not work because immediately you want multi version management and you want to do transactions upon those data storing OpenX storage.
Then people just want to bring some very old database techniques or terminologies back from the database work to the data lake work. So what they do is that they add another metadata layer upon this very low parquets stored in S3. And this metadata layer can help you do ACID transactions, can help you do multi versioning, can help you evolving the table without any confusions or issues. For example, the older version get lost, or you have multi writer writing to the same table and everything mess up. So I think that's the original motivation for the table event to come in. And the next question is that, do the table format and file format has to be designed in a coupled way or decoupled way? Originally, I think when the Sberg is first designed, it is very coupled with Barquet.
Right now, I think there are still very coupled code inside the Sperl Java repository where it has some very coupled code to optimize Parquet. But right now, people are tending to decouple those two layers because file format has B mode. Previously we had Parquet, and now we got Length, we got Nimble, we got F3. And those table format layers, they want to have different abilities from those file formats. For example, we can see that Asperger has an ongoing file format proposal to decouple the file format layer from its code base so that it can better support other file formats like LANS. Yeah, I think in general decoupling is a better way because it can separate the different functionalities from table format and file format and allow more combinations between them.
[00:19:34] Tobias Macey:
And so digging into the f three project itself, what are some of the key design principles and key ideas in it that allow for this more evolutionary and extensible approach versus some of the limiting aspects of formats such as Parquet and Orc?
[00:19:56] Xinyu Zheng:
For this question, I will first discuss some of the Parquet's problems or why parquet doesn't evolve like we wish it to be. I think the first reason is that there are just too many different imputations in parquet. I remember that I once counted how many Parquet implementations are out there, and I believe there are more than 10 or even 20. For example, there are Parquet Java, Parquet C plus plus Parquet Rust, like different languages. And there are also a bunch of proprietary imputations of Parquet in DuckDB, in Snowflake, in some other system we don't know. So the result is that it is really hard to align the behavior of those implementations.
It is possible that one implementation supports, let's say, feature A, while the other does not. But the other one may support feature B, for example, and the first implementation does not support feature b. The consequence of so many implications is that people are afraid of using new features of the format. Although the format spec itself is evolving. We can see Parquet is adding new features year by year, but the people just tend to use the most basic one, which we call the lowest common denominator. Therefore, the format is really hard to evolve. It really takes a long time to align all the implementations on one new feature so that everyone is not freaked out to use that.
I think that's a problem of Parquet. And regarding how F3 solves the challenges, our idea is that first, you should make the file format very extensible in general. That's a very high level goal. And to break down that goal, so we design the format from two aspects. First one is that we try to make the layout extensible, as extensible as possible, right? And second thing is that we embed some custom WebAssembly code to make the file format self decoding. And the second one, I think, is a very promising idea to me, at least. Let's go through them one by one. So the first one is that, as I said, we try to make the layout as flexible as possible. So what does layout mean?
In Parquet, you can imagine a block of data. In file format, have some techniques like encodings and the compression to encode this block of data. Let's say this block of data is just 64,000 rows, right? When you encode and compress it, it becomes a block of data that is being compressed. It's got smaller. Imagine you have a table, a two dimensional table of data, And each column is sublated into those blocks. And now the terminology of layout is just how you're going to organize those data blocks inside your file. So what Parquet does is that it uses something called PEX. It's a terminology from I think it's from 2000, but more generally known terminology is called row groups.
You partition the file into row groups, and each row group contains, for example, a fixed number of rows. Inside that row group, your file is pure columnar, which means the data of one column is stored physically together. That is the layout of Parquet. We found this layout is kind of not that flexible, not that extensible, because first, it bonds too many terminologies together. For example, the size of the row group also controls the size of the IO unit. Second, the size of the row group also controls the scope of the dictionary encoding inside each column. So what does that mean? That means you have one parameter that controls three things, Right? And this is not good when you are trying to extend the layout or when you are trying to tune the parameters, because when you change one thing, you affect three things.
So what we do is that we decouple those three terminologies I just mentioned. Each of them has its own parameter and has its own terminology to be tuned. We also make the layout as flexible as possible. For example, previously, Parquet's dictionary encoding only works for one row group, which means that if you have a dictionary, it can only work for your local row group. And what we do is that we try to expand the scope of that dictionary to the whole file. Now you can have a dictionary that works for the whole file or maybe like half of the file. It's dependent on you. So that's our effort on making the layout to be as extensible as possible.
And the second thing is that we want to make the encoding, which is I just mentioned, inside the data block, you are going need to have some encoding and compression to make the data block as small as possible, right? But we found that in Parquet, although in the past few years, Parquet also adds some new encodings in file format. And in academia, we see that there are many different research proposing new encodings. But none of those new encodings are getting used by users today. So that's the problem. And we are thinking about how to solve it. Right? How to solve it in a way that is forward compatible? And then we came up with the idea that we can actually make the file format self decoding.
So right now in Parquet, you can imagine the file is roughly divided into two parts. One part is called data. Another part is called metadata. Metadata is basically describing the data locations of different row groups or different columns along with some basic statistics like min max, zoom maps. But right now, we want to add a third part. That part is the algorithm or the code to decode the data part. Right? So what's the benefit of adding this third part? Imagine you just invented a very brilliant encoding algorithm that is greater than every encoding algorithm, or that algorithm is only good to your own data. It can only benefit your own data, but nobody else use it.
So in Parquet, if you want to add a new encoding, what you're going to do is you are going to first convince the whole community to accept that encoding into the spec. And then you are going to align all the different imputations I just mentioned, different languages, different proprietary imputations. We have to make sure all of them use this new encoding. And since this encoding can be accepted by everyone. So this is a very long process, and it's really hard to achieve. But now when you can embed your own algorithm, which means that you can just put your new brilliant encoding inside this file, and as long as you ensure the decoding API is stable, then everyone can use that algorithm to decode the data to get the original data back from the file.
So that's the whole idea. And regarding implementation, we choose to use WebAssembly to store the decoding algorithm I just mentioned. So just for background, WebAssembly is a binary instruction format for a stack based virtual machine. That's the official definition. But in my opinion, the initial goal of WASM was to address the performance problem of JavaScript in browser, like to give the program running the browser native performance. That's its goal, but it actually comes with many benefits that are aligned with file format. For example, it is very portable.
Imagine when you are putting some algorithm or code inside a file format. You don't know what kind of architecture or operating system the reader is going to decode that file. You cannot do any assumption on that, which means you cannot combine the code algorithm into an executable or shared library and just put that into the file. So you have to use some virtual machine instrument, and WASM just simply fits in that escape because it's originally designed for browser. For browser, it has to be run on any hardware or any operating system. So it's just a natural fit. And second thing is that because the WASM is optimized for browser, it has to the binary size of the code has to be very small.
Because imagine when you are opening a website, the first thing you are going to do is you download some Watson binaries. If the binary is very large, then you will have a very bad user experience on the initial loading time, initial loading latency of the website, right? So the designers of the WebAssembly instruction format, they try to optimize the binary size as much as possible. This actually also aligns with the design of file format very well. The reason is that when you are embedding some code or algorithm into the file, we know that you add some extra things into the file, and the file set gets bigger, right? But at the bottom, just open by the binary size, and in our experiment we found that for some common encoding algorithm, the size is just kilobytes, like hundreds of kilobytes. It is almost negligible, right?
And I think the third thing that WebAssembly fits into the design of file format is that it's actually very efficient. Although it's a virtual format, it's not like Python or Java running in JVM, right? It has a very high performance that is close to native code. In our experiment on measurement, we found that performance overhead of running WASM compared to native code is like 20%, and we think it's acceptable given the extensibility I just mentioned. So that's basically the whole story of the problem of Parquet's extensibility and how we solve it through layout and extensible decoding design bundled with WASM.
[00:32:17] Tobias Macey:
And then the other piece of introducing any new technology is the question of adoption and integration into the broader ecosystem. And with a format such as f three, I understand that there's definitely a lot of value in the research element, but in terms of solving the problem for the broader ecosystem, there needs to be some adoption of either the f three implementation that you've already done or something built on top of those same principles. And I'm curious how some of that extensibility allows for an ease of adoption and maybe implementing some form of compatibility with existing formats such as Parquet to be able to act as a drop in replacement while benefiting from those extensible capabilities and then being able to incorporate some of the benefits of f three into the various compute engines that would actually be interacting with it?
[00:33:18] Xinyu Zheng:
Well, that's a great question. I think first, the the self decoding ability I just mentioned is actually a very, very attractive feature and is actually a strength for F3 to be integrated into the broader ecosystem. Because once you'll be integrated, it's much easier than other formats to get evolved, right? For other formats, example, when you are proposing a new encoding, you have to go through the process of like submitting a pull request on GitHub, getting it merged, and then you have to make sure all the readers are upgrading to the newer version so that it can use the new feature. But right now in F3, we have this kind of forward compatible ability where old readers can even adopt the new features, the new encodings.
So I think that's actually a very attractive feature from F3 to get adopted. But regarding how to make F3 being adopted at the very first place, I think the first thing is that you have to connect to the Arrow ecosystem, right? Nowadays, Apache Arrow has been the, I think it's the de facto standard for transferring data across API boundaries or across different systems, either over the network or whatever, like interprocess communication, right? So in our design, we actually have native support for Arrow, like all the decoded data is stored in Arrow format, and because we are a research project, our type system is even entirely based on Arrow, right?
So I think for file formatting in general to be faster adopted in the ecosystem, connecting to error is a must. That's the second thing. And the third thing is that you have to make the API ease of use, right? For users that are first seeing the file format library, you cannot be so hard to use. For example, you have to go through a very tedious process of setting up the environment and porting the library into your code base, etcetera. So just make the API ease of use. And I think the last thing is that nowadays we have this concept of composable query system, which is driven by a lot of query engines like Data Fusion, DuckDB, and BitLock, right?
So I think the third thing is that for Fairhopement, you should try to integrate into either of those three, either Data Fusion or DuckDB or Violux. And actually, all of those system has connector to right? So if you are connecting to Arrow everywhere, and then you can connect into those composable data systems, and then your format will be used by more and more users. There's actually a great example file format which is called Vortex. So Vortex is actually doing very great on this as as bad. It has a very native support for Data Fusion and DuckDB, and I believe there will be more users due to due to this feature. Yeah.
[00:37:00] Tobias Macey:
And in your work of doing this research and building the implementation of f three as that proof of concept, I'm wondering what are some of the engineering challenges that you faced while doing that development and research?
[00:37:17] Xinyu Zheng:
Well, I think the first challenge that comes up to my mind is that, as I mentioned, we try to make the file format as flexible as possible, as extensible as possible, right? But the downside of that, is if too flexible, then that means you have too many things to tune, you have too many knobs to tune, because it is flexible, right? You don't know what's the best default to set. For example, in our file format, we have this concept of flexible dictionary scope, which means in stuff like Parquet, each row group has one dictionary.
So each column in the row group has one dictionary. We allow the dictionary to be arbitrarily scoped, which means, for example, for column one, you can have two blocks, two data blocks sharing one dictionary. And for column two, you can have four data blocks shares one dictionary. So that actually leaves an interesting question to the file writer, is that what's the best dictionary scope for all those columns inside your table? Like, how should you optimize this scope? We actually did some research on this, and we found it's not a trivial problem.
There's a trade off between your writing speed and the final compression ratio of your file format. That's the first engineering challenge I can think of. And I think the second challenge is that because we are using WebAssembly to store the decoding code algorithm, right, we have done some engineering using WebAssembly. But we found Watham is actually not that stable in the way that itself as a specification is still evolving. For example, there's a recent WASM64 proposal to try to expand the memory capacity of the WASM runtime, and there's also a thread proposal that tries to add multithread support inside the WASM runtime.
And there are just different kinds of WASM tooling. Some of them is in Rust and some of them is in C plus plus and I even found some of them is abandoned, is not maintained anymore. So I think it's because the community is still very new and technology is just fast evolving, we do see some challenge on using Wasm, on integrating Wasm into the file format. Yeah. I I think that's the two main challenges I can think of.
[00:40:09] Tobias Macey:
And then another aspect of being able to customize the encodings and the handling of some of some of the data in the columns is that it introduces the opportunity to support new data formats where vector representations have been one of the most visible changes in terms of common use cases for the types of data that we want to store. But I'm wondering if there are any other examples of interesting or niche data types that somebody would want to be able to customize and incorporate into the file format and support using these Wasm extensions.
[00:40:52] Xinyu Zheng:
Yeah. Yeah. I mean, record embedding is certainly a key data type to support here because in recent years, again, with the emergence of AI, there is a need for the so called multimodal data. And again, we can see that Parquet, traditional formats like Parquet do not support that wide range of data types very well. They primarily work with traditional OLAP data. So besides vectors, I think there are also other data types like images, audio, and video. Those data types actually impose a challenge on the file format if you really want to store them in place in a file format. So first, those data pads are much larger than the traditional ones. So think of like integer is just eight bytes, right, and strings are maybe 10 bytes or like tens of bytes. But an image is in the level of megabytes, and the video is in hundreds of megabytes or even gigabytes.
So when you feed them into a single file, you will have to make sure those data with a large variety of bits do not mess up with each other. Like, when you are storing those images, those videos, those audios, you are also going to store some metadata, like some descriptions, some feature flags that your model got from those images and videos. And those features, those flags, those metadata are just traditional data. It's integers or like string or float, But you are going to want to store them together, and then there comes a problem that how to align those small data together with those large data, and how should you partition the file so that when you are querying those two types of data together, your query performance will not get affected?
And also, when you are storing those two types of data together, you still have good compression ratio. That's the first question. I think you also mentioned LedDB before. But LedDB is a pioneer in this format. It started from the beginning to incorporate those two type of data together inside of a format. And the second challenge is that we should think about what kind of changes we are going to make to the API to access the file format when we are storing those multimodal data inside the file. For example, when you are accessing video, sometimes the reader only wants a slice of the video or some key frames from the video.
And if you are still using, for example, Parquet, and the only thing you can do is that you store the video as a large blob data type, as a column, right? And without providing the API to do that kind of slicing or extracting frames, the readers just does not have that ability, and then people will not use the new file format. So we should also think about what kind of, like, API changes we should make when we are storing those new data types.
[00:44:16] Tobias Macey:
And so bringing it back up to the level of data lakes and end user use cases when you do start incorporating f three as the storage layer? What are some of the interesting capabilities that that unlocks for the data lakes, whether that's in terms of the table formats, in terms of the compute engines, the types of workloads that it makes more straightforward maybe for more AI focused use cases. I'm just wondering how you're thinking about the ripple effects that a format such as f three and its evolutionary functionality introduces to the broader ecosystem.
[00:44:56] Xinyu Zheng:
Well, I think one interesting point when we are thinking about combining F3 with data lake is that we can actually move the WASM from the file layer to the table layer, or the data lake layer. Previously I mentioned that in F3 we stored, we stored the decoding code as WASM binary inside the file. But right now, if we combine the file format layer with data lake layer, or table format layer, we can actually reduce the redundant storage of WebAssembly code in each file. So imagine if you don't have this ability of combining the two systems together, say we have 10 files, and maybe three of them is sharing one encoding, like new encoding you just proposed, and each of them needs to store a separate copy of the WebAssembly code, right, to make it self contained.
And now if you have another table layer on top of it, the table layer can manage the files together. So the file itself does not need to store the WebAssembly code. You can store the WebAssembly code in the data lake's metadata layer, or you have some separate files to centralize the WASM code. I think that's top one interesting point when combining those together. But the downside is that the file is no longer self contained because you don't have WASM decoding algorithm stored inside the file. So all the reading parts has to go through the data lake layer instead of directly to the file. And second thing I think is that if we have another data lake layer to manage those files, to manage those Watson batteries, another interesting thing we can do is that we should verify the binaries to make it secure, right? So I think one interesting discussion point I saw on the internet regarding F3 is that people will question the security of embedding WASM binaries inside a file, because it is kind of like arbitrary code that is going to be executed by your reader. Although WASM has a sandbox mechanism to make sure the code will not do any harmful thing to our system, but in general people are afraid of this kind of storing code in your file, and you can also write.
So with this data lake layer on top of F3, what people can do is that so this is actually in our proposal of future work. Imagine you can have a centralized repository that's storing all the verifiable Wasm. Like all the Wasm batteries, those decoding kernels are verified by a central government and ensure it's not harmful. Right? And you can even calculate the hash or checksum of those kernels and then compare the checksum inside your own file to those in the central repository to make sure the file is not harmful. Right? And with the data lake layer, I think it can better manage this because it has a metadata layer to manage all those files.
And when the files are ingesting to the lake, it can immediately verify the file is not harmful, and you can ensure the reader will, the bottom code will not break out the reader, will do any bats into the overall system.
[00:49:02] Tobias Macey:
And in your work of building this format, publishing your findings and your implementation, what are some of the most interesting or innovative or unexpected ways that you're seeing people either using or thinking about using the research that you've produced?
[00:49:20] Xinyu Zheng:
I can think of some interesting ideas that's inspired by F3 that's been discussed in either public or in private conversations. For example, right now we are only using WebAssembly to extend the encoding algorithm of the file format, right? But the file format has other parts. For example, it has layout, I just mentioned. Layout is how you partition the file, how you organize those data blocks. Second thing is about indexing or filtering. The file format typically coupled with some very basic zone maps and some filtering structures like Bloom Filter. Some people are thinking about whether we can use WebAssembly to extend the functionalities of those parts as well.
Yeah, I think those are just ideas like people are discussing. I didn't see there's any real system that tried to approach those ideas, but in general those are interesting.
[00:50:29] Tobias Macey:
And in the process of the work that you were doing that resulted in this research output, what are some of the other ideas or hypotheses that you were working through that ultimately got discarded on your path to developing F3?
[00:50:46] Xinyu Zheng:
Well, I think one idea I developed and then discarded is about index. So if you look at the last section of our very first paper, the VLDB paper and empirical evaluation of common formats, in that paper we have a section that summarizes the license learned during our evaluation of Parquet and org. And I remember one of the things that Parquet index is very, very simple, and we should incorporate some sophisticated and complicated indexing and filtering techniques into file format. That's one lesson, and we also follow that lesson to develop some ideas on optimizing the indexing part in Parquet, so trying to develop some new indexing structures.
I think that area in the academic literature is called scan index, or like it's a type of secondary index, but in general, you try to accelerate the scanning of the data along with the filter. So we do spend some time optimizing on that part, but in the end, we realized that the index should be decoupled from the data, from the file. And this actually became our another another cider paper called Towards Functional Decomposition of File Format, where basically you are stating you should decouple index from your data. And the index should be independently tuned instead of like coupled with the data with the file. So for example, in Parquet, it's a zoom map and Bloom Filter is bundled with row group. Each row group has zoom map and has a Bloom Filter.
And I feel it's not the right way because the scope or the block size of the Bloom filter and the zoom mesh should be based on the data distribution and should be based on the query parting, instead of like very closely tied with the file format of the detail. So that's the whole story of our idea on the indexing. Like, developed it first and then we discard it.
[00:53:12] Tobias Macey:
And in your work of doing this research and exploring this overall problem space, what are some of the most interesting or unexpected or challenging lessons that you learned personally in the process?
[00:53:24] Xinyu Zheng:
Well, I think I'd approach this question from a researcher's perspective or from a PhD student's perspective. Okay. So I feel like most of the database research today are about optimizing performance, right? That's good because we are doing system and we are pursuing the greater performance. But I think one nice thing I learned from the F3 project is that the most interesting or the most practical real world problems are not always about performance. Like, for example, the idea of using WebAssembly to extend the format to allow it to be better interoperable with others, to be sensible as much as possible, it's not all about performance.
I mean, in some ways it's about performance, because when you are extending your format to better encoding, you got a better algorithm, right, and your performance should be good. But in general, it's also solving another critical problem we see in the community is that Parquet is not evolving, and the community is really, it's really difficult to maintain those different kind of like invitations. This is actually a very practical observations found the community. So I think to summarize it, the lesson is that as a researcher, we can look at more practical problems in the real world, in the open source community or in the real world software system, and not only focusing on optimizing the performance of algorithm or technique.
[00:55:09] Tobias Macey:
And as you continue your research and your PhD program, what do you have planned for the future of F3? Three? Is that something that you intend to continue evolving or contribute to the community, or is this largely an artifact that supports the research but will ultimately require some new implementation or effort on the part of someone else?
[00:55:33] Xinyu Zheng:
So I will answer this question in two aspects. The first regarding the overall research direction, right now we are focusing on studying the multimodal data cases, and especially the AI use cases of those multimodal data. We try to better dive in to see what changes does the file format or table format layer need to be done to support the new data type, to support the new AI training and inference workload? That's a direction or future plan for the F3 in general. And the second aspect is that regarding F3 project itself, so right now it is still a research prototype, right, because we are a small group of researchers, and we don't have the engineering power to make it a fully featured industrial standard file format.
But we do have plan to make it stronger, to get it adopted by more people, and we are certainly going to put more people on it to do the engineering stuff to make it well equipped.
[00:56:47] Tobias Macey:
Are there any other aspects of your work on the f three project, the research involved, the impact that it has on the ecosystem, or any other related topics that we didn't discuss yet that you would like to cover before we close out the show? I think I I basically cover all the same. Alright. Well, for anybody who wants to follow along with the work that you're doing and get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'm interested in your perspective as what you see as being the biggest gap of the tooling or technology that's available for data management today.
[00:57:21] Xinyu Zheng:
Well, I think the biggest gap is that when people are designing software or technology, they should really try to make it future proof, as the name in the F3 project, right? It has to be forward looking, because when we are designing software, we don't want it to be only fit in the current workload or the current hardware. Right? We want it to be reused in future or survive for a long time. Yeah. I think that's a big lesson I learned from this project, and I also think
[00:57:57] Tobias Macey:
that's a gap for many technologies. Yeah. All right. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the F3 project and the research and helping to move the ecosystem forward and be thinking about more evolutionary and continually evolvable file formats. I'm definitely very excited to see how that starts to percolate through the overall ecosystem and speed up some of the rate at which we can innovate and adopt these new capabilities. So thank you for all of the time and effort you've put into that, and I hope you enjoy the rest of your day.
[00:58:35] Xinyu Zheng:
Thank you, Tobias.
[00:58:44] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and guest background
Path into databases and early research
Origins of the Future Proof File Format (F3)
Why new formats beyond Parquet and ORC
Workload shifts: wide tables and random access for ML
File formats vs. table formats and decoupling
Design goals for F3: extensible layout and self-decoding
Revisiting layout: row groups, dictionaries, and decoupling knobs
Self-decoding with embedded WebAssembly
Adoption strategy: Arrow, APIs, and connectors
Engineering challenges: tuning flexibility and WASM maturity
Supporting multimodal data: vectors, images, audio, video
F3 in data lakes: centralizing WASM and security verification
Community ideas and discarded directions (indexes)
Research lessons beyond performance
Future plans for F3 and closing reflections